Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 201412

Summary: Repeated launch through Automated Services Framework eventually fails
Product: z_Archived Reporter: Kevin Mooney <kmooney>
Component: TPTPAssignee: Stanislav Polevic <stanislav.v.polevic>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: P1 CC: dmorris, igor.alelekov, jcayne, jgwest, jkubasta, martin.hrdlicka, paulslau
Version: unspecifiedKeywords: plan
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Whiteboard: closed460
Bug Depends on:    
Bug Blocks: 221278    
Attachments:
Description Flags
AC debug log zip file
none
zip of process explorer output
none
Close process input/output named pipes
none
Close thread handles
none
Fix broken NativeFileServer.c none

Description Kevin Mooney CLA 2007-08-28 08:08:52 EDT
Build ID: N/A

Steps To Reproduce:
1.Install consuming product RPT
2.Create script to repeatedly launch command line schedule execution
3.Fails with hang usually around 70th iteration


More information:
It is also possible to launch with -Xdebug such that the debugger can be attached when the hang occurs.  At this level it appeared the launchTestExecution sessionConnect() in TestExecutionHarness.java was in a loop over connectThread.join(1000).  The connectThread ("Hyadess Session Connect") is in mutableSession.set(node.connect(specifier)) and never returns.
Comment 1 Paul Slauenwhite CLA 2007-10-11 13:59:50 EDT
Joe, please provide a sizing.
Comment 2 Paul Slauenwhite CLA 2007-11-05 14:44:37 EST
Please provide a sizing for this defect before Monday, November 12, 2007.
Comment 3 Paul Slauenwhite CLA 2007-11-07 06:33:14 EST
Bulk update of the target field to reflect the change in iteration numbers.
Comment 4 DuWayne Morris CLA 2007-11-28 12:43:56 EST
Added a wild guess for sizing and moved to i5.
Comment 5 Paul Slauenwhite CLA 2008-01-07 11:44:56 EST
Deferring to I6.
Comment 6 Paul Slauenwhite CLA 2008-01-07 13:11:12 EST
Required in I5 by a consuming product.
Comment 7 Paul Slauenwhite CLA 2008-02-13 09:23:51 EST
Deferring to I6 since dependent defect is not complete.

This will need to be done as the first I6 work item.
Comment 8 DuWayne Morris CLA 2008-02-15 13:22:03 EST
Unable to reproduce kicking off runs under the debugger.  Didn't see anything obvious from code inspection other than the obvious need to add a timeout to "cure" the loop in connectThread.join(1000).

With profiler instrumented code, finally reproduced the problem. So, rather than a true hang, there is an infinite loop in the workbench in org.eclipse.hyades.internal.execution.local.control.  The AC ends up in a bad state unless re-started (even after killing zombie Java processes and a hung workbench).  Thus, after failure, the problem will repeat.  Currently tracking down the reason for the infinite loop and whether the AC caused the failure or vice-versa.

Updated hours worked. Obviously, my wild guess estimate was light.
Comment 9 DuWayne Morris CLA 2008-02-24 22:07:52 EST
From comment #8, there was no actual issue with the infinite loop in the workbench.  That was actually the loop processing incoming messages from the AC.

In tracking through on the workbench, the issue is that the session JVM gets created, then a message comes back from the AC with the process ID.  The workbench is then waiting for a message that the session JVM is up and in the READY state.  This message never comes in.  So, the problem seems to be the communications link between the freshly created session JVM and the AC.  The entire sequence is repeatable until the AC is restarted.

The ending method calls are:

16:47:09.430 PROBEDATA - exiting org/eclipse/hyades/execution/remote/SessionImpl,init <main>
16:47:09.430 PROBEDATA - entering org/eclipse/hyades/execution/remote/SessionImpl,handleCommands <main>

Upon restarting the AC, all works well until the next time the AC gets into this bad state.  All the expected message traffic flows properly.

I did find an issue with the AC that is a show-stopper for robustness.  There appears to be a rather severe growth in handles being used.  In about an hour, the number of handles was observered to reach 3789 , at which time the AC reached the "bad" state where the execution hangs. Windows documentation states that the maximum number of handles per process is 10000.  Thus, if this handle leak is not the root cause of this failure, it will become an unacceptable  limiting point.  Further data to be provided on Monday, Feb 25.
Comment 10 DuWayne Morris CLA 2008-02-25 08:51:18 EST
Ran again last night, the results are the same.  The number of open handles after around 70 test executions was 3799.  Again, the session JVM seems to finish initialzation, enters the HandleCommands method waiting for messages and that is where the action stops, no messages are received.

23:28:41.271 PROBEDATA - entering org/eclipse/hyades/internal/execution/remote/RemoteComponentSkeleton,agentControllerActive <Thread-2>
23:28:41.271 PROBEDATA - entering org/eclipse/hyades/execution/remote/SessionImpl$1,agentControllerActive <Thread-2>
23:28:41.271 PROBEDATA - entering org/eclipse/hyades/execution/remote/SessionImpl,access$0 <Thread-2>
23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/execution/remote/SessionImpl,access$0 <Thread-2>
23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/execution/remote/SessionImpl$1,agentControllerActive <Thread-2>
23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/internal/execution/remote/RemoteComponentSkeleton,agentControllerActive <Thread-2>
23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/internal/execution/remote/RemoteComponentSkeleton,initialize <main>
23:28:41.271 PROBEDATA - exiting org/eclipse/hyades/execution/remote/SessionImpl,init <main>
23:28:41.271 PROBEDATA - entering org/eclipse/hyades/execution/remote/SessionImpl,handleCommands <main>

Updated hours.
Comment 11 DuWayne Morris CLA 2008-02-25 08:54:13 EST
Ran again last night, the results are the same.  The number of open handles
after around 70 test executions was 3799.  Again, the session JVM seems to
finish initialzation, enters the HandleCommands method waiting for messages and
that is where the action stops, no messages are received.

23:28:41.271 PROBEDATA - entering
org/eclipse/hyades/internal/execution/remote/RemoteComponentSkeleton,agentControllerActive
<Thread-2>
23:28:41.271 PROBEDATA - entering
org/eclipse/hyades/execution/remote/SessionImpl$1,agentControllerActive
<Thread-2>
23:28:41.271 PROBEDATA - entering
org/eclipse/hyades/execution/remote/SessionImpl,access$0 <Thread-2>
23:28:41.271 PROBEDATA - exiting
org/eclipse/hyades/execution/remote/SessionImpl,access$0 <Thread-2>
23:28:41.271 PROBEDATA - exiting
org/eclipse/hyades/execution/remote/SessionImpl$1,agentControllerActive
<Thread-2>
23:28:41.271 PROBEDATA - exiting
org/eclipse/hyades/internal/execution/remote/RemoteComponentSkeleton,agentControllerActive
<Thread-2>
23:28:41.271 PROBEDATA - exiting
org/eclipse/hyades/internal/execution/remote/RemoteComponentSkeleton,initialize
<main>
23:28:41.271 PROBEDATA - exiting
org/eclipse/hyades/execution/remote/SessionImpl,init <main>
23:28:41.271 PROBEDATA - entering
org/eclipse/hyades/execution/remote/SessionImpl,handleCommands <main>

Comment 12 DuWayne Morris CLA 2008-02-25 09:12:02 EST
Checked the operation of TPTP 4.5i5 driver for the handle leak issue.  The behavior is similar.  The number of open handles in RAServer.exe grows for each test exectution.  Running a simple test 4 times from the GUI caused the number of open handles to grow from 90 to 179.  The difference is that the handle leak is closer to 50 per test execution in 4.2.2 versus around 22 per execution on 4.5 branch.  This testing was done on Windows XP.
Comment 13 DuWayne Morris CLA 2008-02-25 09:41:06 EST
FYI,

Here is the link to the microsoft knowledge base article that states the 10,000 handle limit per process on Windows XP:

http://support.microsoft.com/kb/327699
Comment 14 DuWayne Morris CLA 2008-02-25 12:34:06 EST
Ran an ant task with TPTP 4.5i5 to run a single URL test many times through ASF.  After approximately 31 test executions, the AC reached the failure condition where execution was hung and no further tests could be executed. Re-starting the AC "cured" the problem as expected.

In TPTP 4.5, the number of open handles reached about 1080 before the hang failure (rather than 3800).  Thus, the handle leak now seems likely to be a separate issue from the primary failure mechanism.  Both issues need to be fixed in TPTP 4.5 and TPTP 4.2.2.
Comment 15 Igor Alelekov CLA 2008-02-26 11:44:45 EST
Could you please submit log file with DEBUG level for TPTP 4.5m5 AC?
Comment 16 DuWayne Morris CLA 2008-02-26 13:07:19 EST
Created attachment 90772 [details]
AC debug log zip file

Hi,

Good point.  In testing with the original source of the problem in TPTP 4.2.2, I was unable to run with DEBUG turned-on. I was getting an AC process crash nearly right away in 4.2.2.  I'm attaching a zip of the debug log.
Comment 17 Jonathan West CLA 2008-02-27 13:24:21 EST
Hi DuWayne, once the AC enters its failing state, could you generate a list of the open handles for that AC? The best way to do this is to use the Process Explorer program available from Microsoft's SysInternals site. Once in Process Explorer, select View -> Lower Pane View -> Handles. A lower pane in the window will open with a list of handles. To export them to a file, click the handle text, and push CTRL-A... a window will pop up asking you to "Save Process Explorer Data". This text file contains the handle list.
Comment 18 DuWayne Morris CLA 2008-02-27 15:27:26 EST
Created attachment 90923 [details]
zip of process explorer output

Hi Johnathan,

I am attaching the zip you requested.  I didn't run to failure on this one.  Looks like the handle leak rate is a bit lower when running TPTP URL tests instead of the consumming product tests.  Not sure why.  In any case, this snap-shot is up around 1700 handles open.  It seems clear that the handles being returned by CreateThread calls are not being closed.  It appears there is a similar problem with CreateNamedPipe handles.  This is from a 4.2.2 agent running as a service.
Comment 19 DuWayne Morris CLA 2008-02-27 16:43:40 EST
Adding some info.  I ran TPTP URL tests against the 4.2.2 AC instead of consumming product tests.  It was something over 100 tests executed instead of around 70 before failure and the number of open handles was 3010.

I have opened a separate defect, https://bugs.eclipse.org/bugs/show_bug.cgi?id=220665 for TPTP 4.5 branch as was agreed in our conference call today.

Updated hours.

My action item remaining on this issue:

I will provide a patch in the next day or two that will provide a timeout error to the user instead of hanging forever when the AC is not responding.  This patch will also be checked in to TPTP 4.5 code and patched to the 220665 defect.

Comment 20 Stanislav Polevic CLA 2008-03-07 08:37:03 EST
Created attachment 91870 [details]
Close process input/output named pipes

Close stdio/stdin/stdout named pipes
Comment 21 Igor Alelekov CLA 2008-03-11 04:14:27 EDT
The patch looks good.
Applied to the 422 branch.
Comment 22 jkubasta CLA 2008-03-11 06:53:36 EDT
Joel, pls work to provide separate build as we have no way of refreshing 422 in the field.
Comment 23 Stanislav Polevic CLA 2008-03-11 07:02:10 EDT
Please, note that this patch does not cover all handle leaks, just named pipes.
Expect more patches to come up during this week.
Comment 24 Stanislav Polevic CLA 2008-04-10 08:27:50 EDT
Created attachment 95510 [details]
Close thread handles

The patch fixes thread handles leaks by closing each unused handles.
Made for current 4.5 trunk.
Comment 25 Stanislav Polevic CLA 2008-04-10 09:04:14 EDT
Created attachment 95518 [details]
Fix broken NativeFileServer.c

Fix mistakenly unsynchronized NativeFileServer.c.

4.5 trunk
Comment 26 Igor Alelekov CLA 2008-04-11 02:36:28 EDT
(In reply to comment #25)
> Created an attachment (id=95518) [details]
> Fix broken NativeFileServer.c
> Fix mistakenly unsynchronized NativeFileServer.c.
> 4.5 trunk

Applied to the HEAD.
Stas, for future patches, please update copyrights and place patch to a file with bug number in the name.
Comment 27 jkubasta CLA 2008-04-11 08:39:48 EDT
Resolving as fixed now that changes are committed to Head
Comment 28 Igor Alelekov CLA 2008-04-11 10:29:58 EDT
(In reply to comment #27)
> Resolving as fixed now that changes are committed to Head

Stas, please confirm if all leaking handles have been fixed.
Comment 29 jkubasta CLA 2008-08-19 11:01:39 EDT
Kevin, do you recall if the 4.2.2 fix was provided to you? Thanks.
Comment 30 Kevin Mooney CLA 2008-08-19 12:31:45 EDT
(In reply to comment #29)
> Kevin, do you recall if the 4.2.2 fix was provided to you? Thanks.
> 

Not to me, but I have not had involvement since submitting.
Comment 31 Paul Slauenwhite CLA 2009-06-30 12:11:22 EDT
As of TPTP 4.6.0, TPTP is in maintenance mode and focusing on improving quality by resolving relevant enhancements/defects and increasing test coverage through test creation, automation, Build Verification Tests (BVTs), and expanded run-time execution. As part of the TPTP Bugzilla housecleaning process (see http://wiki.eclipse.org/Bugzilla_Housecleaning_Processes), this enhancement/defect is verified/closed by the Project Lead since this enhancement/defect has been resolved and unverified for more than 1 year and considered to be fixed. If this enhancement/defect is still unresolved and reproducible in the latest TPTP release (http://www.eclipse.org/tptp/home/downloads/), please re-open.