Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 168138

Summary: Test launch race condition often delays test launch by 15 seconds
Product: z_Archived Reporter: Joe Toomey <jptoomey>
Component: TPTPAssignee: Jonathan West <jgwest>
Status: CLOSED WONTFIX QA Contact:
Severity: major    
Priority: P2 CC: analexee, guru.nagarajan, jgwest, kathy, kdsiefke, paulslau, toddmm
Version: unspecifiedKeywords: plan
Target Milestone: ---   
Hardware: All   
OS: All   
Whiteboard: housecleaned460 closed460
Attachments:
Description Flags
Patch file for HEAD with fixes none

Description Joe Toomey CLA 2006-12-14 17:50:15 EST
When migrating to the new Agent Controller, a noticable slowdown in test launch speed
was observed by many people, and was attributed to the compatibility layer on top of the
new AC.  More recently, as tests have been written to measure the stability, memory usage
and performance of test execution, I observed that test launches generally take around
20 seconds, but on occasion take only 5 seconds.  In runs of 200 tests, there was a clear
pattern where roughly 30% of the tests ran in 5 seconds, while the rest ran in roughly
20 seconds.

I have investigated the cause of this, and it turns out to be a race condition
in communication between the workbench and the runner's agent.  This race condition was
present with the old agent controller as well, but it appears that the performance characteristics
of certain commands with the new agent controller bias us much more heavily towards losing
the race and paying the performance penalty.

There is a 2-way handshake that takes
place when a test is launched.  The workbench launches the test via the AC and then waits
for the test process to become active.  There is no notification available for this, so
the workbench polls for the process.  On the other side, when the test process comes up,
the runner agent registers itself with the AC and then blocks, waiting to be signaled
by the workbench that it is ready for the test to start.

The race condition occurs
as a side effect of the NodeImpl.listProcesses() method, which despite its innocuous name,
does a lot more than list the processes from the node.  After querying the AC (sending
a command to list the processes from the node), it then queries again to list every agent
within each process, and yet again to get the details of every agent.  It is the last
of these queries that causes the race condition.  The query to list the agent's details
is sent as a command to the agent itself.  If the (runner) agent has already blocked waiting
to hear from the workbench that it is okay to start, then the command will not receive
a response because the agent is blocked.  This results in a (15 second) timeout waiting
for the details of the runner agent to be returned.

Interestingly, this does not cause
a failure in test launch, because we didn't need those details in the first place.
Comment 1 Joe Toomey CLA 2006-12-14 17:53:36 EST
My fix for this is to implement a more efficient listProcesses() method that does not query the properties of every agent.  This will not only remove the race condition, but also eliminate some unnecessary agent controller chatter during the test laucnh sequence.

In preliminary testing with this fix, tests generally launch in 4 seconds.
Comment 2 Joe Toomey CLA 2006-12-14 18:10:10 EST
Created attachment 55727 [details]
Patch file for HEAD with fixes

Use this to apply to 4.2.2 if required.
Comment 3 Joe Toomey CLA 2006-12-14 18:14:08 EST
Delivered to HEAD.
Comment 4 Joe Toomey CLA 2007-08-24 15:25:05 EDT
Rolled this fix back due to:

198964: Regression prevents consuming product test launch
https://bugs.eclipse.org/bugs/show_bug.cgi?id=198964

Will investigate a different way to avoid this race in 4.5
Comment 5 jkubasta CLA 2008-02-12 13:33:56 EST
Igor, would you please take a look at this defect and see if you can resolve?
Comment 6 Igor Alelekov CLA 2008-03-17 05:29:12 EDT
Reassigning to Stas for investigation
Comment 7 jkubasta CLA 2008-05-23 08:56:19 EDT
Deferral to future with PMC approval
Comment 8 Kathy Chan CLA 2009-02-23 13:40:03 EST
Mass update of P1 enhancements and defects targetted to future to P2.
Comment 9 Paul Slauenwhite CLA 2009-06-30 06:56:42 EDT
As of TPTP 4.6.0, TPTP is in maintenance mode and focusing on improving quality by resolving relevant defects and increasing test coverage through test creation, automation, Build Verification Tests (BVTs), and expanded run-time execution. Since this defect is more than 2 years old, it may be no longer relevant. As part of the TPTP Bugzilla housecleaning process (see http://wiki.eclipse.org/Bugzilla_Housecleaning_Processes), this defect is resolved as WONTFIX. If this defect is still relevant and reproducible in the latest TPTP release (http://www.eclipse.org/tptp/home/downloads/), please re-open.
Comment 10 Paul Slauenwhite CLA 2009-06-30 12:12:50 EDT
As of TPTP 4.6.0, TPTP is in maintenance mode and focusing on improving quality by resolving relevant enhancements/defects and increasing test coverage through test creation, automation, Build Verification Tests (BVTs), and expanded run-time execution. As part of the TPTP Bugzilla housecleaning process (see http://wiki.eclipse.org/Bugzilla_Housecleaning_Processes), this enhancement/defect is verified/closed by the Project Lead since this enhancement/defect has been resolved and unverified for more than 1 year and considered to be fixed. If this enhancement/defect is still unresolved and reproducible in the latest TPTP release (http://www.eclipse.org/tptp/home/downloads/), please re-open.