| Summary: | Repeated launch through Automated Services Framework eventually fails | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | z_Archived | Reporter: | Kevin Mooney <kmooney> | ||||||||||||
| Component: | TPTP | Assignee: | Stanislav Polevic <stanislav.v.polevic> | ||||||||||||
| Status: | CLOSED FIXED | QA Contact: | |||||||||||||
| Severity: | normal | ||||||||||||||
| Priority: | P1 | CC: | dmorris, igor.alelekov, jcayne, jgwest, jkubasta, martin.hrdlicka, paulslau | ||||||||||||
| Version: | unspecified | Keywords: | plan | ||||||||||||
| Target Milestone: | --- | ||||||||||||||
| Hardware: | PC | ||||||||||||||
| OS: | Windows XP | ||||||||||||||
| Whiteboard: | closed460 | ||||||||||||||
| Bug Depends on: | |||||||||||||||
| Bug Blocks: | 221278 | ||||||||||||||
| Attachments: |
|
||||||||||||||
Joe, please provide a sizing. Please provide a sizing for this defect before Monday, November 12, 2007. Bulk update of the target field to reflect the change in iteration numbers. Added a wild guess for sizing and moved to i5. Deferring to I6. Required in I5 by a consuming product. Deferring to I6 since dependent defect is not complete. This will need to be done as the first I6 work item. Unable to reproduce kicking off runs under the debugger. Didn't see anything obvious from code inspection other than the obvious need to add a timeout to "cure" the loop in connectThread.join(1000). With profiler instrumented code, finally reproduced the problem. So, rather than a true hang, there is an infinite loop in the workbench in org.eclipse.hyades.internal.execution.local.control. The AC ends up in a bad state unless re-started (even after killing zombie Java processes and a hung workbench). Thus, after failure, the problem will repeat. Currently tracking down the reason for the infinite loop and whether the AC caused the failure or vice-versa. Updated hours worked. Obviously, my wild guess estimate was light. From comment #8, there was no actual issue with the infinite loop in the workbench. That was actually the loop processing incoming messages from the AC. In tracking through on the workbench, the issue is that the session JVM gets created, then a message comes back from the AC with the process ID. The workbench is then waiting for a message that the session JVM is up and in the READY state. This message never comes in. So, the problem seems to be the communications link between the freshly created session JVM and the AC. The entire sequence is repeatable until the AC is restarted. The ending method calls are: 16:47:09.430 PROBEDATA - exiting org/eclipse/hyades/execution/remote/SessionImpl,init <main> 16:47:09.430 PROBEDATA - entering org/eclipse/hyades/execution/remote/SessionImpl,handleCommands <main> Upon restarting the AC, all works well until the next time the AC gets into this bad state. All the expected message traffic flows properly. I did find an issue with the AC that is a show-stopper for robustness. There appears to be a rather severe growth in handles being used. In about an hour, the number of handles was observered to reach 3789 , at which time the AC reached the "bad" state where the execution hangs. Windows documentation states that the maximum number of handles per process is 10000. Thus, if this handle leak is not the root cause of this failure, it will become an unacceptable limiting point. Further data to be provided on Monday, Feb 25. Ran again last night, the results are the same. The number of open handles after around 70 test executions was 3799. Again, the session JVM seems to finish initialzation, enters the HandleCommands method waiting for messages and that is where the action stops, no messages are received. 23:28:41.271 PROBEDATA - entering org/eclipse/hyades/internal/execution/remote/RemoteComponentSkeleton,agentControllerActive <Thread-2> 23:28:41.271 PROBEDATA - entering org/eclipse/hyades/execution/remote/SessionImpl$1,agentControllerActive <Thread-2> 23:28:41.271 PROBEDATA - entering org/eclipse/hyades/execution/remote/SessionImpl,access$0 <Thread-2> 23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/execution/remote/SessionImpl,access$0 <Thread-2> 23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/execution/remote/SessionImpl$1,agentControllerActive <Thread-2> 23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/internal/execution/remote/RemoteComponentSkeleton,agentControllerActive <Thread-2> 23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/internal/execution/remote/RemoteComponentSkeleton,initialize <main> 23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/execution/remote/SessionImpl,init <main> 23:28:41.271 PROBEDATA - entering org/eclipse/hyades/execution/remote/SessionImpl,handleCommands <main> Updated hours. Ran again last night, the results are the same. The number of open handles after around 70 test executions was 3799. Again, the session JVM seems to finish initialzation, enters the HandleCommands method waiting for messages and that is where the action stops, no messages are received. 23:28:41.271 PROBEDATA - entering org/eclipse/hyades/internal/execution/remote/RemoteComponentSkeleton,agentControllerActive <Thread-2> 23:28:41.271 PROBEDATA - entering org/eclipse/hyades/execution/remote/SessionImpl$1,agentControllerActive <Thread-2> 23:28:41.271 PROBEDATA - entering org/eclipse/hyades/execution/remote/SessionImpl,access$0 <Thread-2> 23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/execution/remote/SessionImpl,access$0 <Thread-2> 23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/execution/remote/SessionImpl$1,agentControllerActive <Thread-2> 23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/internal/execution/remote/RemoteComponentSkeleton,agentControllerActive <Thread-2> 23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/internal/execution/remote/RemoteComponentSkeleton,initialize <main> 23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/execution/remote/SessionImpl,init <main> 23:28:41.271 PROBEDATA - entering org/eclipse/hyades/execution/remote/SessionImpl,handleCommands <main> Checked the operation of TPTP 4.5i5 driver for the handle leak issue. The behavior is similar. The number of open handles in RAServer.exe grows for each test exectution. Running a simple test 4 times from the GUI caused the number of open handles to grow from 90 to 179. The difference is that the handle leak is closer to 50 per test execution in 4.2.2 versus around 22 per execution on 4.5 branch. This testing was done on Windows XP. FYI, Here is the link to the microsoft knowledge base article that states the 10,000 handle limit per process on Windows XP: http://support.microsoft.com/kb/327699 Ran an ant task with TPTP 4.5i5 to run a single URL test many times through ASF. After approximately 31 test executions, the AC reached the failure condition where execution was hung and no further tests could be executed. Re-starting the AC "cured" the problem as expected. In TPTP 4.5, the number of open handles reached about 1080 before the hang failure (rather than 3800). Thus, the handle leak now seems likely to be a separate issue from the primary failure mechanism. Both issues need to be fixed in TPTP 4.5 and TPTP 4.2.2. Could you please submit log file with DEBUG level for TPTP 4.5m5 AC? Created attachment 90772 [details]
AC debug log zip file
Hi,
Good point. In testing with the original source of the problem in TPTP 4.2.2, I was unable to run with DEBUG turned-on. I was getting an AC process crash nearly right away in 4.2.2. I'm attaching a zip of the debug log.
Hi DuWayne, once the AC enters its failing state, could you generate a list of the open handles for that AC? The best way to do this is to use the Process Explorer program available from Microsoft's SysInternals site. Once in Process Explorer, select View -> Lower Pane View -> Handles. A lower pane in the window will open with a list of handles. To export them to a file, click the handle text, and push CTRL-A... a window will pop up asking you to "Save Process Explorer Data". This text file contains the handle list. Created attachment 90923 [details]
zip of process explorer output
Hi Johnathan,
I am attaching the zip you requested. I didn't run to failure on this one. Looks like the handle leak rate is a bit lower when running TPTP URL tests instead of the consumming product tests. Not sure why. In any case, this snap-shot is up around 1700 handles open. It seems clear that the handles being returned by CreateThread calls are not being closed. It appears there is a similar problem with CreateNamedPipe handles. This is from a 4.2.2 agent running as a service.
Adding some info. I ran TPTP URL tests against the 4.2.2 AC instead of consumming product tests. It was something over 100 tests executed instead of around 70 before failure and the number of open handles was 3010. I have opened a separate defect, https://bugs.eclipse.org/bugs/show_bug.cgi?id=220665 for TPTP 4.5 branch as was agreed in our conference call today. Updated hours. My action item remaining on this issue: I will provide a patch in the next day or two that will provide a timeout error to the user instead of hanging forever when the AC is not responding. This patch will also be checked in to TPTP 4.5 code and patched to the 220665 defect. Created attachment 91870 [details]
Close process input/output named pipes
Close stdio/stdin/stdout named pipes
The patch looks good. Applied to the 422 branch. Joel, pls work to provide separate build as we have no way of refreshing 422 in the field. Please, note that this patch does not cover all handle leaks, just named pipes. Expect more patches to come up during this week. Created attachment 95510 [details]
Close thread handles
The patch fixes thread handles leaks by closing each unused handles.
Made for current 4.5 trunk.
Created attachment 95518 [details]
Fix broken NativeFileServer.c
Fix mistakenly unsynchronized NativeFileServer.c.
4.5 trunk
(In reply to comment #25) > Created an attachment (id=95518) [details] > Fix broken NativeFileServer.c > Fix mistakenly unsynchronized NativeFileServer.c. > 4.5 trunk Applied to the HEAD. Stas, for future patches, please update copyrights and place patch to a file with bug number in the name. Resolving as fixed now that changes are committed to Head (In reply to comment #27) > Resolving as fixed now that changes are committed to Head Stas, please confirm if all leaking handles have been fixed. Kevin, do you recall if the 4.2.2 fix was provided to you? Thanks. (In reply to comment #29) > Kevin, do you recall if the 4.2.2 fix was provided to you? Thanks. > Not to me, but I have not had involvement since submitting. As of TPTP 4.6.0, TPTP is in maintenance mode and focusing on improving quality by resolving relevant enhancements/defects and increasing test coverage through test creation, automation, Build Verification Tests (BVTs), and expanded run-time execution. As part of the TPTP Bugzilla housecleaning process (see http://wiki.eclipse.org/Bugzilla_Housecleaning_Processes), this enhancement/defect is verified/closed by the Project Lead since this enhancement/defect has been resolved and unverified for more than 1 year and considered to be fixed. If this enhancement/defect is still unresolved and reproducible in the latest TPTP release (http://www.eclipse.org/tptp/home/downloads/), please re-open. |
Build ID: N/A Steps To Reproduce: 1.Install consuming product RPT 2.Create script to repeatedly launch command line schedule execution 3.Fails with hang usually around 70th iteration More information: It is also possible to launch with -Xdebug such that the debugger can be attached when the hang occurs. At this level it appeared the launchTestExecution sessionConnect() in TestExecutionHarness.java was in a loop over connectThread.join(1000). The connectThread ("Hyadess Session Connect") is in mutableSession.set(node.connect(specifier)) and never returns.