Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 166213 - The agent controller is unreliable when used to profile external java applications
Summary: The agent controller is unreliable when used to profile external java applica...
Status: CLOSED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: TPTP (show other bugs)
Version: unspecified   Edit
Hardware: PC Windows XP
: P3 blocker (vote)
Target Milestone: ---   Edit
Assignee: Andy Kaylor CLA
QA Contact:
URL:
Whiteboard: closed460
Keywords:
Depends on:
Blocks:
 
Reported: 2006-11-29 10:42 EST by amehrega CLA
Modified: 2016-05-05 10:53 EDT (History)
7 users (show)

See Also:


Attachments
Test classes (1.27 KB, application/octet-stream)
2006-11-29 10:43 EST, amehrega CLA
no flags Details
The Agent Controller log file (59.02 KB, text/plain)
2006-11-29 16:32 EST, amehrega CLA
no flags Details
Proposed fix for this problem (9.56 KB, patch)
2006-11-30 20:18 EST, Andy Kaylor CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description amehrega CLA 2006-11-29 10:42:10 EST
Build:  TPTP-4.2.1.1-200611271752

Follow the steps *exactly* as outlined below:
Make sure that a standalone installation of the Agent Controller is configured and started.

1) Extract the zip file attached to a directory.  It should contain three files: HelloWorld.class, Classpath.class, and ForClasspath.class.
2) Open the profile launch configuration and create a launch configuration of type "External Java Application".
3) Switch to the Main tab and browse to the "Classpath.class" file
4) Switch to the Monitor tab and de-select everything but "Execution Time Analysis"
5) Click apply
6) Repeat step 2-5 but select "HelloWorld.class" in step 3)
7) Keep alternating between launching Classpath and Helloworld.  

The launch progress will eventually get stuck at 78% and it will not respond until it is restarted.  This is blocking me from running the automated launch test suites.  I can consistenly reproduce this problem on my machine.
Comment 1 amehrega CLA 2006-11-29 10:43:16 EST
Created attachment 54719 [details]
Test classes
Comment 2 amehrega CLA 2006-11-29 10:48:20 EST
I'm using IBM JRE 1.5:

java version "1.5.0"
Java(TM) 2 Runtime Environment, Standard Edition (build pwi32dev-20060511 (SR2))

IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Windows XP x86-32 j9vmwi3223-2006050
4 (JIT enabled)
J9VM - 20060501_06428_lHdSMR
JIT  - 20060428_1800_r8
GC   - 20060501_AA)
JCL  - 20060511a
Comment 3 Karla Callaghan CLA 2006-11-29 12:43:33 EST
Andy - please take a look immediately so that an assessment can be made for 4.2.1.1.
Comment 4 Andy Kaylor CLA 2006-11-29 15:37:21 EST
I am unable to reproduce this problem.  I ran about 50 iterations of these tests without any problems.  I tried both with a 4.3 TPTP client and a 4.2.1.1 client.

My JVM is slightly different than yours.  My JVM is:

java version "1.5.0"
Java(TM) 2 Runtime Environment, Standard Edition (build pwi32devifx-20060124)
IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Windows XP x86-32 j9vmwi3223ifx-2006
0124 (JIT enabled)
J9VM - 20051027_03723_lHdSMR
JIT  - 20051027_1437_r8
GC   - 20051020_AA)
JCL  - 20060120

I believe the JVM you are using is one that Paul Slauenwhite was using for a problem he reported last week.  I wasn't able to reproduce that problem either (nor was I able to find a place to download that JVM).

Can you try this with a different JVM?

Kevin says the automated test suites worked for him with 4.2.1.1 on an Itanium-based system.
Comment 5 amehrega CLA 2006-11-29 16:31:36 EST
I suspect that this is a concurrency problem that can't consistently be reproduced.  Nevertheless, I can easily reproduce this problem when I alternate between launching the two applications.

Navid too has hit similar problems.  I will try this with a different JVM and let you know the results shortly.
Comment 6 amehrega CLA 2006-11-29 16:32:26 EST
Created attachment 54747 [details]
The Agent Controller log file

Here's the agent controller log file in case it's needed.
Comment 7 amehrega CLA 2006-11-29 16:35:01 EST
I can reproduce this with Sun JRE:

java version "1.5.0_06"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
Java HotSpot(TM) Client VM (build 1.5.0_06-b05, mixed mode)

Btw, I have a hyperthreaded machine.
Comment 8 amehrega CLA 2006-11-29 16:37:08 EST
I can reproduce this after the third launch.  Make sure that you perform the process launches quickly using the drop down menu of the profile toolbar item.

Here's the order: HelloWorld, Classpath, HelloWorld (this process hangs).
Comment 9 Navid Mehregani CLA 2006-11-29 16:55:29 EST
I've come across this problem before.

Note that someone has also mentioned this problem recently in the newsgroup.  See the following entry posted on Nov 27th by Michael Sachs:

Profiler - Launching stops at 78%
Comment 10 Andy Kaylor CLA 2006-11-29 18:27:11 EST
OK, I was waiting for one agent to terminate before starting the next.  When I try it fast, I can reproduce the problem.

The thing that happens is that an exception occurs inside ipcStopFlushing.  We have an exception handler that catches the error and keeps this from manifesting itself as a crash, but there's a thread waiting for a response to the StopAgentDataFlush message.

This is very probably caused by the first set of fixes we put into the 4.2.1.1 branch for memory leaks.  I don't think it will be easy to fix.  I'm don't really know why the exception is occurring in the ipcStopFlushing routine.
Comment 11 Andy Kaylor CLA 2006-11-29 18:31:05 EST
One more bit of information.  I still can't reproduce this on my laptop (which is the only place I could run the IBM JVM).  The machine I reproduced it on has a Pentium 4 with hyperthreading technology.  So maybe the way that changes the timing is the key factor.
Comment 12 Karla Callaghan CLA 2006-11-29 19:15:20 EST
Ali - From what Andy has learned so far, it looks like the same issue would exist in 4.3.  What is the difference in your 4.2.1.1 testing and 4.3 testing?  (Was the hyperthreaded system used only in 4.2.1.1 testing?)
Comment 13 Andy Kaylor CLA 2006-11-29 19:47:36 EST
Further testing seems to indicate that this doesn't actually depend upon the speed with which you invoke the profiling runs.  I can reproduce this on my hyperthreading box even if I pause between runs.

To verify that you are seeing the same problem that I am, look in your "bin" directory immediately after a crash and see if it contains a file called "tptpParseError.log".  I would expect that it does and that this file would contain a line that looks something like this:

11/29/06 16:08:24 "C:\Eclipse\Builds\AC4211_1127\bin\ACServer.exe") Unexpected exception occurred while parsing Cmd in message "<Cmd src="100" dest="127" ctxt="1281"><stopDataFlush iid="org.eclipse.tptp.legacy"></stopDataFlush></Cmd>"

As it turns out, the expection handler for our parser catches a lot more than the parsing errors it was intended to catch, and it records what it has caught in this relatively obscure file.

It turns out that the way the problem manifests itself is the result of the fact that we are waiting for a response to the stopDataFlush command.  I don't think we need to wait, but we are waiting.  If I take out the line that waits, then the exception handler masks the problem (which is not to say this fixes the problem).  I've made this change in my sandbox and verified that it hides this problem (though I havent' done any testing to verify that it doesn't introduce something else).

So, while it is not anything I would be proud of, I think we can make this problem "go away", which might be OK so long as we remember this and actually fix it in an upcoming release.

The other troubling thing about this is that it seems like it should happen in 4.3 as well.  I don't see any reason it wouldn't.
Comment 14 Andy Kaylor CLA 2006-11-29 20:09:16 EST
It does happen with 4.3 on my system.  It also doesn't seem to depend on alternating between the two classes.  I was able to reproduce it just by repeatedly (not quickly) launching the Classpath.class run.
Comment 15 amehrega CLA 2006-11-29 22:38:34 EST
I was using my laptop during 4.3 testing to launch the test suites that detected this problem.  I had encountered this problem manually on my desktop before but I never had any consistent way of reproducing it.  I was hoping that the reliability patches that went in during 4.3 had solved this problem but this doesn't seem to be the case.  What puzzles me is why the Agent Controller just completely stops responding after the error.  It always needs to be restarted after the error is encountered.
Comment 16 Navid Mehregani CLA 2006-11-30 14:28:08 EST
I hit this problem on _first_ try with the TPTP-4.2.1.1-200611271752A driver installed on my desktop!  I'm using an IBM JVM 1.5.
Comment 17 Navid Mehregani CLA 2006-11-30 14:35:15 EST
This is why we need to stress test the Agent Controller.  I believe most of these problems can be flushed out with proper stress test cases.  See bug#160940
Comment 18 Andy Kaylor CLA 2006-11-30 20:18:29 EST
Created attachment 54849 [details]
Proposed fix for this problem

This attachment contains the source changes for the fix to this problem.  This patch is based on the 4.2.1.1 code branch, but I think it can also be applied to the 4.3 based code.

Since the change may not be approved for either the 4.3 or 4.2.1.1 releases, I'm putting the code here to be accessed when we are ready to put it in a future release.
Comment 19 Harm Sluiman CLA 2006-12-01 16:00:20 EST
this is really unfortunate.

In reality both 4.3 and 4.2.1.1 are already closed up.
The memory improvements were a big step forward and we have to accept that as the line fo those releases.

It seems we need to get this into the 4.2.2 and it's sister 4.3 stream asap, and for sure in head so it can be throughly tested.
Comment 20 Marius Slavescu CLA 2006-12-02 01:35:33 EST
I found that this scenario and others works fine with AC from TPTP 4.2.1 when using both TPTP 4.3 and TPTP 4.2.1.1 clients.

I don't know what was fixed in TPTP 4.2.1.1 regarding AC, but we could probably mention in the Readme, that if the user encounters one of the problems described in this bug, to try to use the AC from TPTP 4.2.1.
Comment 21 Andy Kaylor CLA 2006-12-04 15:11:52 EST
I've checked a fix for this into HEAD.  It will be available in 4.4 builds.

Also, Marius is right.  This bug is not present in 4.2.1 or earlier releases.  The fixes in 4.2.1.1 and 4.3 are mostly stability and memory leak fixes.  There are significant improvements in those areas, but if someone is running into this problem a lot, the older releases might be useful.
Comment 22 Joe Toomey CLA 2006-12-18 18:15:47 EST
*** Bug 167747 has been marked as a duplicate of this bug. ***
Comment 23 Andy Kaylor CLA 2007-01-18 15:05:42 EST
This fix is now in the 4.2.2 code stream also
Comment 24 Paul Slauenwhite CLA 2009-06-30 12:04:25 EDT
As of TPTP 4.6.0, TPTP is in maintenance mode and focusing on improving quality by resolving relevant enhancements/defects and increasing test coverage through test creation, automation, Build Verification Tests (BVTs), and expanded run-time execution. As part of the TPTP Bugzilla housecleaning process (see http://wiki.eclipse.org/Bugzilla_Housecleaning_Processes), this enhancement/defect is verified/closed by the Project Lead since this enhancement/defect has been resolved and unverified for more than 1 year and considered to be fixed. If this enhancement/defect is still unresolved and reproducible in the latest TPTP release (http://www.eclipse.org/tptp/home/downloads/), please re-open.