Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 229189

Summary: Random intermitten errors while launching a test against remote agents
Product: z_Archived Reporter: Mark D Dunn <mddunn>
Component: TPTPAssignee: Jonathan West <jgwest>
Status: CLOSED FIXED QA Contact:
Severity: blocker    
Priority: P1 CC: dmorris, jgwest, jkubasta, jptoomey, paulslau, stanislav.v.polevic, stephen.francisco, xubing
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Whiteboard: closed460
Bug Depends on:    
Bug Blocks: 222382, 229134    
Attachments:
Description Flags
serviceconfig.xml from 4.5 M6 IAC
none
Patch for intermittent errors none

Description Mark D Dunn CLA 2008-04-28 17:36:55 EDT
Build ID: 4.5 i6 

A consuming product consistently receives errors when launching a test on multiple agents.  I am trying to launch a very simple test on 6 - 10 agents (Windows, AIX, and Linux platforms). I have not been able to launch the test successfully more than 6 times in a row before a failure occurs on one or more agents.


More information:
Although I don't have exact statistics on this, my launch success with the M5 driver was much better than with M6.  I was able to launch 10 out of 10 times with M5 at one time. I can not tell at this time if the problem is with the Harness or with the agent.
Comment 1 Mark D Dunn CLA 2008-04-28 17:47:04 EDT
Most of the time (last week) I got InactiveProcess exceptions.  Today, I got different errors - sometmes InactiveAgent exceptions (see below  and a other times I got a basic communications error message from the consuming product. 

eclipse.buildId=unknown
java.fullversion=J2RE 1.6.0 IBM J9 2.4 Windows XP x86-32 jvmwi3260-20080415_18762 (JIT enabled, AOT enabled)
J9VM - 20080415_018762_lHdSMr
JIT  - r9_20080415_1520
GC   - 20080415_AA
BootLoader constants: OS=win32, ARCH=x86, WS=win32, NL=en_US
Framework arguments:  -product com.ibm.rational.rpt.product.ide
Command-line arguments:  -os win32 -ws win32 -arch x86 -product com.ibm.rational.rpt.product.ide


Info
Mon Apr 28 17:19:13 EDT 2008
RPTJ0064I An InactiveAgentException has occurred on Driver: mdunnxp.raleigh.ibm.com
org.eclipse.hyades.internal.execution.local.control.InactiveAgentException
	at org.eclipse.hyades.internal.execution.local.control.AgentImpl.invokeCustomCommand(AgentImpl.java:447)
	at com.ibm.rational.test.lt.execution.rac.LoadTestCommandHandler.sendCommand(LoadTestCommandHandler.java:1116)
	at com.ibm.rational.test.lt.execution.rac.LoadTestCommandHandler.sendTerminate(LoadTestCommandHandler.java:1008)
	at com.ibm.rational.test.lt.execution.rac.LoadTestCommandHandler.run(LoadTestCommandHandler.java:215)
	at java.lang.Thread.run(Thread.java:735)


org.eclipse.hyades.internal.execution.local.control.InactiveAgentException
	at org.eclipse.hyades.internal.execution.local.control.AgentImpl.invokeCustomCommand(AgentImpl.java:447)
	at com.ibm.rational.test.lt.execution.rac.LoadTestCommandHandler.sendCommand(LoadTestCommandHandler.java:1116)
	at com.ibm.rational.test.lt.execution.rac.LoadTestCommandHandler.sendTerminate(LoadTestCommandHandler.java:1008)
	at com.ibm.rational.test.lt.execution.rac.LoadTestCommandHandler.run(LoadTestCommandHandler.java:215)
	at java.lang.Thread.run(Thread.java:735)

 
Comment 2 Mark D Dunn CLA 2008-04-28 17:48:40 EDT
I set the status as major.  This will probably end up as a critical or blocker after more discussion.
Comment 3 Paul Slauenwhite CLA 2008-04-29 06:46:57 EDT
Reassigning to Duwayne since this defect is related to Duwayne's I6 enhancement.

Duwayne, please triage, assign a sizing, and provide a patch for I7.
Comment 4 Paul Slauenwhite CLA 2008-04-29 06:49:38 EDT
Increasing the severity since impacting a consuming product and a regression from I5.
Comment 5 Mark D Dunn CLA 2008-04-29 16:46:01 EDT
There appears to be problems even with the M5 Agent Controllers, although M5 still works a little better than M6.  Today I used only Windows Agent Controllers.

To determine if the problem is in the workbench or in the Agent Controller, I ran a RPT 8 workbench against a known RAC - the RPT 7.0.2 RAC (which has tptp4.2.2).  I was able to run 13 successful 10 agent runs in a row before I stopped.

I then ran the rpt 8 workbench against a tptp4.5M5 RAC, and I had 8 successful runs before I got a failure.  Yesterday I was not able to run more than 4 in a row using the M6 RAC.

Two of the failures are shown below:

1. 

eclipse.buildId=unknown
java.fullversion=J2RE 1.6.0 IBM J9 2.4 Windows XP x86-32 jvmwi3260-20080415_18762 (JIT enabled, AOT enabled)
J9VM - 20080415_018762_lHdSMr
JIT  - r9_20080415_1520
GC   - 20080415_AA
BootLoader constants: OS=win32, ARCH=x86, WS=win32, NL=en_US
Framework arguments:  -product com.ibm.rational.rpt.product.ide
Command-line arguments:  -os win32 -ws win32 -arch x86 -product com.ibm.rational.rpt.product.ide


Error
Tue Apr 29 15:50:35 EDT 2008
RPTA0004E A Test could not be launched on Driver: rptcore6.
The Test Execution Framework was not able to deliver an Executor.
This is an internal error, please contact support.



2.

Error
Tue Apr 29 15:50:35 EDT 2008
RPTA0011E An error has been encountered while launching a Test on Driver: rptcore6.
An Executor was not returned and neither was an error message.
This is an internal error, please contact support.
Comment 6 Bing Xu CLA 2008-05-02 11:44:26 EDT
Can someone attach the IAC's service config xml file here.   It's under plugin org.eclipse.tptp.platform.ac.<platform>_<version>/agent_controller/config
Comment 7 DuWayne Morris CLA 2008-05-05 08:58:29 EDT
Adding initial sizing, this is truly a wild first guess since we have no idea where the issues are.
Comment 8 DuWayne Morris CLA 2008-05-05 14:55:11 EDT
Created attachment 98687 [details]
serviceconfig.xml from 4.5 M6 IAC

Hi Bing,

Attaching the serviceconfig.xml from the IAC per your request.  This was from Mark Dunn's machine.
Comment 9 jkubasta CLA 2008-05-06 08:24:24 EDT
This is must fix.  Please address this week
Comment 10 Bing Xu CLA 2008-05-06 10:06:16 EDT
Hi DuWayne, I looked at your IAC config file.  The only thing different is that for the follwoing two variables, you have multiple vesion of the same jar installed.  TPTP M6 uses V4.5.0.  I think V4.5.9 were installed by other Rational products. Though given the fact that 4.5.0 is still in front of 4.5.9 in the classpath, this shouldn't cause any problem, I think just to be safe, can you rename the 4.5.9 jars (changing .jar to .txt will do it for IAC)? 



<Variable name="CLASSPATH_ORG_ECLIPSE_TPTP_PLATFORM_MODELS" position="append" value="D:\Program Files\IBM\IBMIMShared\plugins\org.eclipse.tptp.platform.models_4.5.0.v200803232151\tptp-models.jar;D:\Program Files\IBM\IBMIMShared\plugins\org.eclipse.tptp.platform.models_4.5.9.v200804180102\tptp-models.jar" /> 
  <Variable name="CLASSPATH_ORG_ECLIPSE_TPTP_PLATFORM_MODELS_HIERARCHY" position="append" value="D:\Program Files\IBM\IBMIMShared\plugins\org.eclipse.tptp.platform.models.hierarchy_4.5.0.v200804010100\tptp-models-hierarchy.jar;D:\Program Files\IBM\IBMIMShared\plugins\org.eclipse.tptp.platform.models.hierarchy_4.5.9.v200804180102\tptp-models-hierarchy.jar" /> 
Comment 11 jkubasta CLA 2008-05-06 13:30:17 EDT
DuWayne, please see Bing's comments and reply. Thanks.
Comment 12 Bing Xu CLA 2008-05-06 15:47:59 EDT
Stanislav, is this related to 201412?  Basically testing on AC was fine for the first 300 tries and then it won't start.  It happends on Linux with M6 release.
Comment 13 Stanislav Polevic CLA 2008-05-07 08:54:06 EDT
Well, that may be connected.
I'll check AC for open handles and provide more details tomorrow.
Comment 14 Stanislav Polevic CLA 2008-05-08 08:11:19 EDT
I tested the AC and it seems to leak 4 IPC handles on Linux per run (profiling external app in the workbench).

output of lsof command:
> ACServer   6910 svpolevi  140r  FIFO        3,1             1627866 /tmp/IBMRAC/46ec5b61-6a17-450c-a2cd-a1443caf1ba6-stdout
> ACServer   6910 svpolevi  141r  FIFO        3,1             1627867 /tmp/IBMRAC/46ec5b61-6a17-450c-a2cd-a1443caf1ba6-stderr
> ACServer   6910 svpolevi  142w  FIFO        3,1             1627865 /tmp/IBMRAC/46ec5b61-6a17-450c-a2cd-a1443caf1ba6-stdin
> ACServer   6910 svpolevi  143w  FIFO        3,1             1627868 /tmp/IBMRAC/d69a026b-9392-4735-8948-7abd494bd4ec

I'm not sure this driver contains my fixes for bug 201412.
Comment 15 jkubasta CLA 2008-05-08 08:25:29 EDT
201412 was not fixed until M7/i7
Comment 16 Paul Slauenwhite CLA 2008-05-12 07:40:11 EDT
Moving over to the Platform Project since defect appears to be an issue with the Agent Controller.
Comment 17 Paul Slauenwhite CLA 2008-05-12 10:28:03 EDT
Could defect #230426 be contributing to this behavior?
Comment 18 Bing Xu CLA 2008-05-12 10:46:10 EDT
Paul, I will ask suggest Mark & Duwayne to try it again with once 230426 is fixed.  From Stanislav's comments, they were affected by 201412 for sure.  
Comment 19 Paul Slauenwhite CLA 2008-05-12 10:58:58 EDT
(In reply to comment #18)
> Paul, I will ask suggest Mark & Duwayne to try it again with once 230426 is
> fixed.  From Stanislav's comments, they were affected by 201412 for sure.  
> 

We are not planning on resolving #230426 in 4.5 unless it can proven to causing any tangible symptoms.

Duwayne: Can you determine if #230426 could be causing any tangible symptoms?
Comment 20 Jonathan West CLA 2008-05-12 12:32:19 EDT
Note, if it's occurring in M5, and it's occurring as soon as the agent controller is started (and not after hundreds of repeated runs), this implies this problem is unrelated to handles.
Comment 21 Bing Xu CLA 2008-05-12 12:48:08 EDT
Duwayne, is this still occuring?  Or it occurs only with M5?
Comment 22 DuWayne Morris CLA 2008-05-12 13:05:07 EDT
In response to Bing, we experiencing this problem with both M6 and i7. Our comparative testing was better with M5.  However, I do not believe we tested enough with M5 to draw a conclusion given the wide variability in results.

Marking this as a blocking defect. This defect is a blocker to a consumming product.  We will review the comments above (15 to 19) and I will provide further notes later.  There is a memo that Johnathan is seeing a failure 1/3 URL test executions on a URL test using the i7 driver.

Using the M6 AC, I have run many tests using 8 to 10 windows only agents in the consumming product. Failure rates have varied widely from 1 in 3 to 5 runs versus 1 in 15 or more executions (using the same agents and the same runtime workbench).  Many of the failures actually happen at some point after launch is completed, including after the agent reaches the "READY" state or even later such as communication errors ending in a hung state during log transfer after the test execution is completed. We will working toward providing more details to help reproduce and isolate the problem.

Mark Dunn reported that failures have continued to present themselves with the i7 AC. 


The key point that everyone should keep in mind is that there are wide variations in failure rate with no change in configuration.  Do not draw conclusions from a string of 15 or 20 runs without failure.

As far as leaking handles, I would comment that the original 4.2.2 AC that we have been using up until now in our product was leaking handles at a very high rate and would ultimately limit us to around 200 test executions on Windows due to the process limit on open handles.  However, these leaking handles did NOT cause the lack of reliability we are experiencing currently.  So, while leaking handles is not good, it is not necessarily related to the problem we are having.
Comment 23 Jonathan West CLA 2008-05-12 13:07:54 EDT
Thanks DuWayne. From my own testing, there has been a drastic increase in the intermittent errors between M5 and M7. On M5 I don't see any errors even after running 40+ URL tests. On M7, It's every 3-4 tests that I get an error.
Comment 24 Bing Xu CLA 2008-05-12 14:00:18 EDT
(In reply to comment #17)
> Could defect #230426 be contributing to this behavior?

Paul, is 230426 new in M7 or it exists since M5?   What we found is that the URL testing fails 25% of the time in M7 but in M5/M6, it has better success rate.
Comment 25 Jonathan West CLA 2008-05-12 14:06:27 EDT
I am able to run 50+ URL tests without any errors on M6, as well. Suggests a change between M6 and M7.
Comment 26 Joe Toomey CLA 2008-05-12 14:20:53 EDT
I am fairly certain that 230426 is not a regression at all, but rather a bug that has been dormant (because you only notice the bug if you are relying on that method being called.)  Paul tried to use that method to do some cleanup and found that it wasn't being called reliably.

Paul -- please speak up if you disagree.
Comment 27 DuWayne Morris CLA 2008-05-12 14:25:52 EDT
Just as a single data point, I couldn't remember if it was 300 or 400 (100 at a time and lost count doing other things) consecutive URL tests on a Linux agent using M6 before failure.  That failure was the one that left the AC in an un-usable state that I gave to Bing to look at.  The workbench was not able to communicate with the agent at that point.

Likewise, Todd performed a bunch of single agent tests with generally high numbers of tests (~70 to ~200) before failure with M6 and earlier AC.

I have not tested with i7, but Johnathan and Mark both have reported that it is worse than M6.  So, it would appear that i7 has been a significant regression from M6.

Comment 28 Paul Slauenwhite CLA 2008-05-12 14:56:53 EDT
(In reply to comment #26)
> I am fairly certain that 230426 is not a regression at all, but rather a bug
> that has been dormant (because you only notice the bug if you are relying on
> that method being called.)  Paul tried to use that method to do some cleanup
> and found that it wasn't being called reliably.
> 
> Paul -- please speak up if you disagree.
> 

Agree.  This defect is not a new regression.
Comment 29 Bing Xu CLA 2008-05-12 16:43:32 EDT
Problem with M7's ProcessControlUtil.dll in <AC>/bin.  URL test worked fine for 20 runs after I replaced ProcessControlUtil.dll in M7 using the one from M6.  Jonathan is looking into this right now.
Comment 30 Jonathan West CLA 2008-05-13 00:21:44 EDT
The 10-20% failure rate we're seeing here seems to be a race condition, based on the behaviour I've seen thus far. In fact, in order to increase the incidence rate of the behaviour, I've had to use a program called Orbut (turbo, backwards) to simulate CPU usage. 

Paul or DuWayne, the failure of the test follows these steps, which you may be able to shed some light on: As part of the URL test, two Java processes launch. The first is NodeImpl, and the second is HyadesJUnitRunner. The behaviour I'm seeing is: The first Java process starts up as expected. That first Java process (NodeImpl) then launches a second Java process (HyadesJUnitRunner), which in the standard scenario runs through the URL test. However, on the 10-20% failure case, the second Java process starts, but quickly dies. The first process then waits (presumably waiting for an event from the non-existent second), and finally fails on a time out.  

As you're familiar with the test execution code, can you trace through and let me know why that second process is failing? Is it attempting to register with the AC, but failing? Is it not able to locate something in the classpath? Thanks!

Comment 31 DuWayne Morris CLA 2008-05-13 10:35:59 EDT
In the cases where we see InactiveProcessException, your comments represent  exactly what we have observed on some of the failures.  The test runner gets kicked off but immediately dies.  It is true that we do not know why the runner process dies.  I have looked for problems with deployment to no avail thus far. 

The InactiveProcessException is a 60 second timeout looking for the test runner that died immediately and never did connect.

Troubleshooting is more difficult than it could be, since CreateProcess for the test runner is being called by C code in the AC via a JNI call from the session JVM (startProcess0).  Thus, we don't get direct access to the failure.  

It could be an improper classpath when the JVM is kicked off in C code or something similar.  Perhaps something like an improperly handled command string mangled in JNI code.  Maybe a string passed in a buffer and not always null terminated at the correct location during JNI translation.  Perhaps a C variable that is not properly initialized.  Many ways to break C and JNI code.  The changing characteristics of a wide-range in failure rate (1 in 3 to 5 versus 15 or so with no failures in a row) suggests this sort of problem.

Finally, I want to re-emphasize that only a portion of the failures that we see.  On multi-agent launches, we randomly see failures at various points in the launch and even later. On account of this factor and other observations, I am not inclined to think this is a race condition.

What we are concerned about is that once we get past the regression in i7, then we will revert to the more elusive random failures that exist in M6.  

Comment 32 Joe Toomey CLA 2008-05-13 11:23:52 EDT
This is good progress, Jonathan.

I agree with DuWayne's concern that once we fix this regression, we'll likely be back to the M6 level of instability, but I don't see any other choice.  Let's fix them one at a time, as fast as we can.

To debug the problem of the runner dying, I suggest a first step of determining whether the command line is valid or not.  There are several ways to do this, but one of them is to use process explorer to "catch" the process before it dies, and then inspect the command line of the process.  In this case, it should be the same command line that is used when the process launches successfully.  (Sometimes it helps to slow down ProcessExplorer's refresh rate to allow you to catch the process before it exits.)  If the command line is munged, then we can explore how that happened within the AC.

If the command line is not munged, then it's possible that the runner died because of some unexpected interaction with the agent controller.  Is there anything in the AC log file that indicates that it received communication from the runner that subsequently died?  (If there is, it would render the above debugging unnecessary.)  If we need to, would could instrument the runner to write to a file to do some primitive tracing.  This is likely a case where more involved diagnostic techniques would result in a Schrödinger's cat measurement problem (i.e. by trying to observer the system, we end up changing it.)  Hopefully logging to a simple file would be unobtrusive enough to not mask the failure.

Please try to determine (either using process explorer to inspect the command like or by locating evidence of communication in the AC log file) if the process JVM actually starts successfully when the failure occurs.  If it does not, we need to find out how the parameters passed to the session have become corrupted.  If it does, please let me know and I'll make an instrumented runner jar to collect more information.

Thanks!
Comment 33 Jonathan West CLA 2008-05-13 11:51:00 EDT
>> Is there anything in the AC log file that indicates that it received communication from the runner that subsequently died? 

The only messages we're seeing in the agent controller are 'Agent not found RemoteSession ' and 'A processExitedEvent was received for which we have no corresponding agent'. The first is normal for an agent for which we have no received no configuration information (which is fine because it's an optional field), and the second is indicating that one of the agents has sent us a process exited event that we did not expect (which may or may not be innocuous)

As to whether or not the problem is a race condition, the intermittent nature of the problem would suggest to me a race condition. If it was a logic problem without time dependence, presumably we would see it all the time. But it would be great if it weren't one. ;)


Comment 34 Jonathan West CLA 2008-05-13 12:37:01 EDT
Also note that these same AC messages appear consistently in the non-failing scenario as well: a single set of them for every test that is run.
Comment 35 Jonathan West CLA 2008-05-13 16:10:48 EDT
Just to complete the thought from my previous message, as per the above, we aren't seeing an out of the ordinary agent controller messages. Nor do I see any communication problems in the agent controller log files related to any of the launched JVM agents.

As for the rest of the information you're looking for, Joe, I'm not familiar with the test framework and/or it's agents. Paul/DuWayne, can you give Joe the information that he is requesting?
Comment 36 Bing Xu CLA 2008-05-13 16:30:28 EDT
Duwayne, can you or mark try to run the same URL test using M7 build and M6 stand-alone AC?   This way we can avoid the problem with M7's ProcessControlUtil.dll and include the fix for 201412.

I tried this combo and run the URL testing 400 times and no error was reported.
Comment 37 DuWayne Morris CLA 2008-05-14 14:27:15 EDT
The Linux AC continues to demonstrate serious issues, including hogging the CPU, spontaneous termination of ACServer process, etc.

Per the discussion with Bing, I ran stress tests on my Linux RedHat 5 machine, using TPTP 4.5i8 TP1 candidate driver to a 4.5 M6 AC.  I ran 188 tests, there was a gradual slowdown in executions to the point of around 5 minutes per simple test execution or it was really hung. I restarted the workbench.  There were two zombie java processes that I killed. 

I re-started the workbench, left the AC running and experienced the serious slowdown again, and closed the workbench on execution 143.  A single java process was running that I killed.  I then noticed that the ACServer process was continuously consumming 20 to 50% of CPU while not running tests.

After re-starting the workbench, I was immediately getting extremely slow test executions, like 12 executions in 45 minutes.  I decided to re-start the workbench and the AC.  I closed the workbench.  I ran ./RAStop.sh only to find that the AC was not running.  Thus, at some point, the AC process had gone away and I had evidently been running on the IAC.

I set the preferences to disable the IAC and started over.  Tests then started executing at the normal speed up to the 15th test while I was writing some of these results.  Then, again, I am getting a long hang on the 16th test execution.  It is a simple test to Google, where the local browser quickly connnects from the test machine with no problem at the same time as the hang.  ACServer process is not consumming significant CPU at this point.  I did get a couple of RMI exceptions.  One of them may have been due to closing the workbench during a hang and the other might have been a communications failure with the AC.

The AC log is still available for examination and I can provide access to the machine.  

I re-started the workbench and started running tests again wihout re-starting the AC.  This time, I created a new recording to run against a local server, just in case some of the behavior may have been associated with Internet network issues.  After about 300 executions, there were a string of RMI exceptions followed by a Daemon Connect Exception for each execution attempt.

Again, the AC was using around 30 to 50% CPU with no tests running.  Re-started the workbench.  There was a hang, no launch, followed by "the Host local host was not found"  error.  Again, the ACServer process was no longer running.

If there is anything else we need to do to narrow the issue on Linux, please let us know.  (I do need the machine also to finish i8 TP1).
Comment 38 DuWayne Morris CLA 2008-05-14 14:38:47 EDT
There was a question yesterday about the number of open handles allowed, since some of the AC issues have been with leaking handles.

On Windows, the maximum number of open handles per process is 10,000, as discussed in the following Microsoft Knowledge Base article:

http://support.microsoft.com/kb/327699

On Linux Red Hat, the default limit is 1024 open file descriptors per process, which can be tuned to a higher limit if needed.   The command ulimit -a displays the various limits.

Comment 39 Bing Xu CLA 2008-05-14 14:45:19 EDT
(In reply to comment #37)

> Per the discussion with Bing, I ran stress tests on my Linux RedHat 5 machine,
> using TPTP 4.5i8 TP1 candidate driver to a 4.5 M6 AC.  I ran 188 tests...

How is the M8 with M6 AC combo on Windows?
Comment 40 Jonathan West CLA 2008-05-14 15:42:57 EDT
Hi DuWayne.... on the agent controller you used on the Linux machine, the size of the servicelog.log file seems to have reached an OS maximum (either memory or file): The servicelog.log file is EXACTLY 2^31 - 1 bytes in size (2.14 GB).  

Actually, come to think of it, the file IO keeps track of its place in the file and writes to that position (e.g. with an open file handle). It looks like it hit the maximum range of an signed integer, and overflowed, crashing the AC.

My recommendation, when running a large amount of data through the agent controller, I'd highly suggest that you turn off the debug, to keep from overflowing the AC. This would also explain the slowdowns you were seeing. 

Can you see if you still have the same types of problems with the debug option off on Linux?

Comment 41 Jonathan West CLA 2008-05-14 15:47:29 EDT
If in fact this log file was generated without debug (I can't tell based on the serviceconfig.xml), can you look through and see if we're in some kind of loop? 2.14 GB is a lot of data to push from Raleigh to Toronto. ;)
Comment 42 Jonathan West CLA 2008-05-14 16:02:35 EDT
Bing and I have looked at the log file a bit more, and it looks like it only contains INFORMATION or higher entries. Can you try running with a higher log level like ERROR, to see what the result it? I'd hate to think that the default logging level was verbose enough to affect your use case.... 
Comment 43 DuWayne Morris CLA 2008-05-14 16:24:24 EDT
Interesting, I just installed the AC fresh this morning and took the defaults. I did not realize this would result in such a huge file and had not noticed it.  It was in a loop only in the sense that I was using Joe's stress test plugin and running the number of tests that I indicated in my writeup.

Maybe there was an infinite loop inside the AC code which would explain the high CPU usage when no test was running. If that is the case, we will get this problem at any log level when there is a failure.  Did you see anything in the log time stamps to indicate a tight loop?  I didn't do anything that should have caused such a condition.  Given that possiblity, any further thoughts?  What log level do you want me to set?

Comment 44 Jonathan West CLA 2008-05-15 13:42:26 EDT
Created attachment 100498 [details]
Patch for intermittent errors

This patch fixes the intermittent test failures problems we've observed.
Comment 45 Jonathan West CLA 2008-05-15 13:44:28 EDT
Stanislav, can you review the patch?
Comment 46 Joe Toomey CLA 2008-05-15 15:47:27 EDT
Thanks Jonathan.

I have confirmed that your patch improves the reliability significantly.  Using your patch, I executed ~350 JUnit tests before a test run hung.  (The runner and session JVMs were both gone, but the workbench was still waiting.)  Nothing of interest in the AC log (which I left at INFORMATION level because of https://bugs.eclipse.org/bugs/show_bug.cgi?id=232394 ).  After I recycled the workbench (but not the AC), I was able to continue running tests.

We'll continue trying to debug the remaining stability issues in launching tests from the consuming product (which our quick tests show to still be far worse than 4.2.2, and will still block us from shipping.)  But it's good to get past the first hurdle.  Thanks.

Comment 47 Bing Xu CLA 2008-05-15 16:32:10 EDT
(In reply to comment #46)

> your patch, I executed ~350 JUnit tests before a test run hung.  (The runner
> and session JVMs were both gone, but the workbench was still waiting.)  

Joe, I assume this was on Windows since Jonathan only provided the patch for Windows.  It's good that the runner and session JVMs were both gone,  any idea what the workbench is waiting for?

> Nothing
> of interest in the AC log (which I left at INFORMATION level because of
> https://bugs.eclipse.org/bugs/show_bug.cgi?id=232394 ).  

How big is the log after 350 run?  Is it even close to the 2G we got yesterday?
Comment 48 Joe Toomey CLA 2008-05-15 22:44:34 EDT
(In reply to comment #47)
> (In reply to comment #46)
> 
> > your patch, I executed ~350 JUnit tests before a test run hung.  (The runner
> > and session JVMs were both gone, but the workbench was still waiting.)  
> 
> Joe, I assume this was on Windows since Jonathan only provided the patch for
> Windows.  It's good that the runner and session JVMs were both gone,  any idea
> what the workbench is waiting for?

Correct -- all windows only.  The workbench was waiting for the test to complete, which it apparently did not do.  The runner appeared to have died of unnatural causes.

> > Nothing
> > of interest in the AC log (which I left at INFORMATION level because of
> > https://bugs.eclipse.org/bugs/show_bug.cgi?id=232394 ).  
> 
> How big is the log after 350 run?  Is it even close to the 2G we got yesterday?
> 

Good question.  The log was quite small -- a bit less than 8 MB.
Comment 49 Stanislav Polevic CLA 2008-05-16 04:01:10 EDT
Cross-boundary allocation again. Sigh...

Jonathan, your patch is good.
Comment 50 jkubasta CLA 2008-05-16 17:26:37 EDT
5/15/2008 patch applied to Head with PMC approval
Comment 51 Steve Francisco CLA 2008-05-30 10:26:22 EDT
is this fixed?  (status still shows NEW)
Comment 52 jkubasta CLA 2008-05-30 10:34:53 EDT
Still investigating on Linux
Comment 53 jkubasta CLA 2008-06-11 16:40:02 EDT
No further changes are required.
Comment 54 Paul Slauenwhite CLA 2009-06-30 12:12:33 EDT
As of TPTP 4.6.0, TPTP is in maintenance mode and focusing on improving quality by resolving relevant enhancements/defects and increasing test coverage through test creation, automation, Build Verification Tests (BVTs), and expanded run-time execution. As part of the TPTP Bugzilla housecleaning process (see http://wiki.eclipse.org/Bugzilla_Housecleaning_Processes), this enhancement/defect is verified/closed by the Project Lead since this enhancement/defect has been resolved and unverified for more than 1 year and considered to be fixed. If this enhancement/defect is still unresolved and reproducible in the latest TPTP release (http://www.eclipse.org/tptp/home/downloads/), please re-open.