Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 201791

Summary: Agent Controller message delivery problem in long runs
Product: z_Archived Reporter: Kevin Mooney <kmooney>
Component: TPTPAssignee: Jonathan West <jgwest>
Status: CLOSED WORKSFORME QA Contact:
Severity: critical    
Priority: P1 CC: jkubasta, luc.auvray, sluiman
Version: unspecifiedKeywords: plan
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Whiteboard: closed460
Attachments:
Description Flags
Hyades Client/Agent Long RunTest none

Description Kevin Mooney CLA 2007-08-30 16:33:18 EDT
Build ID: N/A

Steps To Reproduce:
1.Execute long running (eg 120 hour) RPT playback.
2.
3.


More information:
Throughout the course of an RPT test (or schedule) run, the RPT runner process sends a few messages (5) every period of time (5 seconds) to the workbench.  These messages are sent via
long org.eclipse.hyades.internal.execution.remote.RemoteComponentSkeleton.sendMessageToAttachedClientReturn(String arg0, long arg1) (or void org.eclipse.hyades.internal.execution.remote.RemoteComponentSkeleton.sendMessageToAttachedClient(String arg0, long arg1) )

after some period of time (typically about 48 hours) these message are no longer delivered to the workbench.

when using the version of the method that returns the number of bytes written, the return value is always >0 even after the messages are no longer delivered.
Comment 1 Samson Wai CLA 2007-09-11 09:42:13 EDT
Adjust sizing.
Comment 2 Samson Wai CLA 2007-11-27 09:30:37 EST
Hi Bing. I have transferred my bugs to you for triage. Thanks.
Comment 3 jkubasta CLA 2007-11-27 10:16:23 EST
Needed by a consuming product.  Please see if reproducible with 4.5
Comment 4 Jonathan West CLA 2008-01-28 15:16:34 EST
Created attachment 88051 [details]
Hyades Client/Agent Long RunTest

I've written a Hyades client and agent in order to simulate the behaviour described in the bug report. This file is a ZIP containing the code for the test in an Eclipse Project. In order to run it, you'll need to check out org.eclipse.hyades.execution from CVS, and add that as a required project to the LoggingCore project in Eclipse.

Source files of interest are:
- TestAgent.java - Self contained agent
- TestClient.java - Self contained client
- LogUtil.java - Log to file utility used by above
- LocalTest.java - Local test of client and agent together.

When run, the agent registers itself with the ACServer and waits for the client to connect. When the client connects, the agent sends messages to the client using 'sendMessageToAttachedClientReturn' at a user defined interval. Both agent and client log the output to their own separate .log files. These log files can then be compared for irregularities using an application like Beyond Compare.
Comment 5 Jonathan West CLA 2008-01-28 15:25:51 EST
I've run three long run tests, using the above code. All have passed.

First Test:
- Local machine only.
- Started January 9th at 3:13pm. 
- Ended January 14th at 12:18pm
- Ran almost five days. (appx 117 hours)
- Test was 2 messages per second, from agent to client.

- Passed. No missed messages.


Second Test:
- Local machine only.
- Started January 15th, 9:52 AM.
- Ended January 18th, 9:28 AM.
- 12,494,242 messages sent.
- Almost three days (appx 71 hours)
- Test was 50 messages per second.

- Passed. No missed messages.


Third Test:
- Remote test between a Windows 2003 Server (x64 Edition) host and a Windows XP host
- Started January 24 3:09pm
- Ended January 28 12:50pm
- 6,634,454 messages sent.
- Almost four days (appx 94 hours)
- Test was 20 messages per second

- Passed. No missed messages.


All tests using TPTP 4.5.0 i4.

Comment 6 jkubasta CLA 2008-01-30 10:33:44 EST
Kevin, please reopen this defect if you are able to reproduce it.
Comment 7 Harm Sluiman CLA 2008-01-30 13:15:31 EST
Do we have a test that simply blows the doors off and pushes as much data as possible for an extended period of time? I would expect volumes could be much high that the reported problem, but shoudl still work perfectly.

Also are the particular sizes of payloads in the reported use case? like 1 meg or gig of data at a time versus 100 bytes?
Comment 8 Jonathan West CLA 2008-02-13 16:01:47 EST
Hi Kevin, I think I may have found the problem, or at least a contributing factor. 

Quick question: on average, how large is the message you are sending from the agent? That is, the average length of the 'String arg0' value passed into sendMessageToAttachedClientReturn(...). 
Comment 9 Kevin Mooney CLA 2008-02-14 09:05:38 EST
From James Sutton:

well, I'm thinking that this defect was written before we merged the userstates message and the heartbeat message.

The heartbeat message from runner to workbench was  "ACTIVE,uuu,mmmm" where uuu is the number of users between 1 and 4 or 5 digits, and the mmmm is the memory usage  again 1-5 digits.  typically 16 chars or so.
The response message from workbench to runner  was "ACTIVE".  always 6 chars.


Additionally, there were userstates messages which were longer in both directions.

Now there are only the userstates messages.
Comment 10 Jonathan West CLA 2008-03-31 14:26:54 EDT
The problem may be related to 220968. A memory leak over time in the agent code would eventually exhaust the memory available to the operating system. When the failure is observed in the described scenario, has the java agent (javaw.exe) ballooned to a much larger size than would be expected?
Comment 11 Kevin Mooney CLA 2008-03-31 15:31:41 EDT
(In reply to comment #10)
> The problem may be related to 220968. A memory leak over time in the agent code
> would eventually exhaust the memory available to the operating system. When the
> failure is observed in the described scenario, has the java agent (javaw.exe)
> ballooned to a much larger size than would be expected?
> 

Unfortunately we don't have any information about the size of the java agent, though typically I would look for the existance of the process and if large would have noted it.  Perhaps fixed indirectly or by reducing our message frequency we have runs now that last days.
Comment 12 Paul Slauenwhite CLA 2009-06-30 12:09:56 EDT
As of TPTP 4.6.0, TPTP is in maintenance mode and focusing on improving quality by resolving relevant enhancements/defects and increasing test coverage through test creation, automation, Build Verification Tests (BVTs), and expanded run-time execution. As part of the TPTP Bugzilla housecleaning process (see http://wiki.eclipse.org/Bugzilla_Housecleaning_Processes), this enhancement/defect is verified/closed by the Project Lead since this enhancement/defect has been resolved and unverified for more than 1 year and considered to be fixed. If this enhancement/defect is still unresolved and reproducible in the latest TPTP release (http://www.eclipse.org/tptp/home/downloads/), please re-open.