Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 232394 - AgentController crashes when set to DEBUG log level
Summary: AgentController crashes when set to DEBUG log level
Status: CLOSED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: TPTP (show other bugs)
Version: unspecified   Edit
Hardware: PC Windows XP
: P1 critical (vote)
Target Milestone: ---   Edit
Assignee: Stanislav Polevic CLA
QA Contact:
URL:
Whiteboard: closed460
Keywords:
Depends on:
Blocks:
 
Reported: 2008-05-15 15:31 EDT by Joe Toomey CLA
Modified: 2016-05-05 10:52 EDT (History)
10 users (show)

See Also:


Attachments
Project containing a simple TPTP JUnit test (5.65 KB, application/zip)
2008-05-20 15:18 EDT, Joe Toomey CLA
no flags Details
TPTP Test Stress plugin (14.60 KB, application/java-archive)
2008-05-20 15:21 EDT, Joe Toomey CLA
no flags Details
Fixes the problem in CBE library (14.90 KB, patch)
2008-05-27 12:29 EDT, Stanislav Polevic CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Joe Toomey CLA 2008-05-15 15:31:49 EDT
In testing with the current 4.5 Agent Controller (TPTP-4.5.0-200805061330), if you edit the serviceconfig.xml file to turn log level up to DEBUG on Windows, the Agent Controller crashes after very little use.  In my testing (which involved repeatedly running TPTP tests), the AC consistently crashed in less than 10 test executions.  When the log setting is left at the default (INFORMATION), I was able to run ~350 times before encountering a failure.  More details on that failure will be forthcoming.

This seems like a critical error to me, because we need to turn up the log level in order to diagnose other problems with the AC.  If we can't get more log detail, we're debugging with our eyes closed.
Comment 1 Jonathan West CLA 2008-05-15 18:28:39 EDT
Stanislav, I think this is likely due to changes you made to the NamedPipeListener.c file that were contributed by patches by 201412 and a few subsequent patches.

The only additional logging messages added to the agent controller between i6 and i7 are all of the following 'TPTP_LOG_DEBUG_MSG' lines in NamedPipeListener.c:

else if (rc == TPTP_PIPE_HAS_ENDED)
{
	/* this is expected and normal */
	TPTP_LOG_DEBUG_MSG(pThreadData->pRequestBlk->pServerData, "Console: End of pipe recieved - exiting");
	pThreadData->threadStatus = TPTP_IDLE_THREAD;
}
				}

And:

	TPTP_LOG_DEBUG_MSG(pThreadData->pRequestBlk->pServerData, "Console: Closing named pipes") ;
    
	rc = disconnectFromNamedPipe(consoleHandle);
	TPTP_LOG_DEBUG_MSG1(pThreadData->pRequestBlk->pServerData, "Disconnect from named pipe %d", rc);
	cleanPipeUp(&consoleHandle);

	disconnectFromNamedPipe(pRequestDataBlock->consoleInputPipe);
	cleanPipeUp(&pRequestDataBlock->consoleInputPipe);

	tptp_deleteSemaphore(&pRequestDataBlock->hasProcessIdSemaphore);

	TPTP_LOG_DEBUG_MSG(pThreadData->pRequestBlk->pServerData, "Stdout/stderr listening thread exited");


Can you take a look at this bug, Stanislav? Thanks!
Comment 2 Stanislav Polevic CLA 2008-05-16 05:34:55 EDT
Jonathan, could you provide more details on the test case or recommend similar test case?
Comment 3 Joe Toomey CLA 2008-05-20 15:16:56 EDT
Hi Stanislav,

It is fairly easy to reproduce this problem.  Just configure the stand alone Agent Controller to output DEBUG level logging, and then run a TPTP test (any TPTP test) repeatedly.  In my testing, this crashes the AC consistently before you have reached 20 test runs (where we're hoping for reliability on the order of thousands of launches.)

Here are some more detailed instructions:

1) Using the stand alone Agent Controller, edit your serviceconfig.xml file and change the following line:

	<Logging directory="%TPTP_AC_HOME%\config" format="CBE" level="INFORMATION"/>

to 

	<Logging directory="%TPTP_AC_HOME%\config" format="CBE" level="DEBUG"/>

2) Restart the AC

3) Import the attached project into your TPTP workspace.  (I have created a project with a TPTP JUnit test for you to run.)

4) Run the test from that project (ACCrash/src/JUnit1.testsuite) by right clicking and selecting Run -> Test.

5) Goto 4 until it crashes.

If this becomes tedious, I will also attach a plugin that you can install into your workbench to repeatedly run selected tests.  In order to use it, install the plugin into your workbench (put it in the eclipse\dropins folder and restart the workbench.)  Then select the test (ACCrash/src/JUnit1.test suite) in the Java perspective package explorer and choose the ACDiag -> TPTP Test Stress menu item.  It will ask how many times to run the test -- the default of 50 always crashes the AC for me (when configured with DEBUG log level) long before 50 executions have been completed.

Thanks,
--Joe
Comment 4 Joe Toomey CLA 2008-05-20 15:18:03 EDT
Created attachment 101120 [details]
Project containing a simple TPTP JUnit test

Project containing a simple TPTP JUnit test.  Import this into your workspace and run this test with DEBUG log level set for the AC to see the crash.
Comment 5 Joe Toomey CLA 2008-05-20 15:21:46 EDT
Created attachment 101122 [details]
TPTP Test Stress plugin

Utility plug-in to repeatedly execute selected test(s).  Install as any eclipse plugin and use menu item ACDiag -> TPTP Test Stress to repeatedly invoke tests.
Comment 6 Jonathan West CLA 2008-05-20 20:38:31 EDT
Thanks Joe!

Stanislav, since the bug was reported, I've had some time to investigate. The problem occurs in agent controller builds previous to i7, so it's not likely related to the changes mentioned in comment #1. Can you see if you can locate the problem? I think it may be in the SocketTL code, but see what you can find. Thanks!
Comment 7 Jonathan West CLA 2008-05-20 22:15:52 EDT
The plugin is a .JAR file (com.ibm.tptp.internal.stress_1.1.0.jar); to install it you can just place it into the 'eclipse\dropins\eclipse\plugins' dir. Plugin requires Java 1.6. In order to use it, once it's installed, move into the 'Resource' perspective and select a .testsuite file, then enter the number of iterations. The crash generally occurs between 10-20 iterations. Finally, I'd recommend turning off IAC in the workbench, as this tends to confuse the results somewhat as to when the standalone AC crashes.
Comment 8 Stanislav Polevic CLA 2008-05-22 09:16:22 EDT
Joe and Jonathan, thanks for comprehensive instructions.
I've managed to reproduce it on latest development build and currently I'm busy reproducing it on debug version of AC.
Comment 9 Stanislav Polevic CLA 2008-05-22 09:25:50 EDT
There is no crush in debug version of AC, however, after 5 runs workbench cannot find the corresponding agent:
"Test 'src/JUnit1.testsuite' started.
Preparing launch delegate...
Performing pre-launch check...
Performing final launch validation...
Initializing source locator...
Launching delegate...
Waiting....
Test result 'JUnit1.execution' collected.
Test 'src/JUnit1.testsuite' completed.
Test execution completed!
Test execution started.
Test 'src/JUnit1.testsuite' started.
Preparing launch delegate...
Performing pre-launch check...
Performing final launch validation...
Initializing source locator...
Launching delegate...
org.eclipse.hyades.execution.harness.TestExecutionHarnessExecutorStub$1@c52900 N
o such agent
Waiting.........................................................................
"
Comment 10 Jonathan West CLA 2008-05-22 09:35:27 EDT
Hi Stanislav, one suggestion, if you use a non-debug version of the hcjbnd(.so/dll) file, you may have better luck.
Comment 11 Joe Toomey CLA 2008-05-22 09:51:41 EDT
On one crash occasion, I was able to convince devenv.exe to Just In Time debug the AC, and although I did not have debug symbols at the time, I did get a stack trace that implies the crash is caused by code in the TPTPLogUtils.dll.  So you could try using only that debug dll (with release build for the rest of the AC) and see if that sheds any light.

Comment 12 Stanislav Polevic CLA 2008-05-23 11:34:14 EDT
I see a lot of java processes, they seem to be spawned by AC.
These processes are active and eat a lot of CPU time.
I believe that this happens because of the test invocation process - AC starts java with starter written in Java and starter, in turn, invokes another java process with real test. After that starter terminates (I think) and invoked java process becomes zombie.
Comment 13 Stanislav Polevic CLA 2008-05-27 12:29:42 EDT
Created attachment 102184 [details]
Fixes the problem in CBE library

This patch is for review only!

Workbench was able to run 50 tests. I did not check more since it always failed after 5th run before.

Jonathan, please, review and check it.
Comment 14 Jonathan West CLA 2008-05-27 14:10:54 EDT
Stanislav, what is the particular problem with the CBE library that this patch is fixing? And do you know what statements in the agent controller are causing the problem?
Comment 15 Stanislav Polevic CLA 2008-05-28 05:27:15 EDT
TPTPCBE.cpp:180
sourceComponentIdentification->fpRoot->setLocationType(sourceComponentIdentification, "IPV4");
which is called from tptp_createLogCBE(...);


The problem is that sometimes fpRoot is freed by another copy of java process.

I think the practice of _dynamically_ reallocating and freeing the function pointer table with _static_ content is irrelevant.
Comment 16 Jonathan West CLA 2008-05-28 22:05:22 EDT
Hi Stanislav, I don't presently know enough of the CBE library in order to confirm whether or not your patch is correct, and I don't think anyone has touched that code for more than a year and a half.  Is there a way to patch the agent controller, rather than the CBE, to fix the bug? Do we know why it's occurring only in DEBUG mode, and not in every mode? Is there a particular debug statement (or set of statements) that's causing the problem?
Comment 17 Stanislav Polevic CLA 2008-05-29 08:30:48 EDT
(In reply to comment #16)
> Hi Stanislav, I don't presently know enough of the CBE library in order to
> confirm whether or not your patch is correct, and I don't think anyone has
> touched that code for more than a year and a half.
First of all, I ran the test over 300 times and it did not fail... The problem seems to be fixed... You may have another tests to ensure the patch is correct.

>  Is there a way to patch the
> agent controller, rather than the CBE, to fix the bug?
The core problem of the CBE code provides a pointer table to the _static_ set of internal functions to the outside code (AC) and it frees that pointer when its internal counter is less or equal to zero. And sometimes it frees the pointer which is used by another thread.
Currently, I see a only way to fix the problem - synchronize tptp_createLogCBE function, but it will decrease the performance.

> Do we know why it's
> occurring only in DEBUG mode, and not in every mode? 
In debug mode we have increased number of messages and, thus, better chance to reproduce this condition race.

>Is there a particular
> debug statement (or set of statements) that's causing the problem?
> 
No, this is a floating bug.
Comment 18 jkubasta CLA 2008-05-29 09:15:32 EDT
Marius, any chance you could provide us with some guidance on this defect please?
Comment 19 jkubasta CLA 2008-05-30 14:54:25 EDT
Paul, Joe, are either of you familiar with this CBE code?
Comment 20 Paul Slauenwhite CLA 2008-05-30 15:03:06 EDT
(In reply to comment #19)
> Paul, Joe, are either of you familiar with this CBE code?
> 

I spoke with Joe concerning the patch and from a CBE perspective, it is fine.  However, I am not familiar enough with C to deem the patch stable.
Comment 21 Joe Toomey CLA 2008-05-30 15:32:39 EDT
I'm unfamiliar with the code base, but I know C, and from a language standpoint, I don't have a problem with Stanislav's fix.  It's not the only way to fix it, and it would be nice to understand why the original author chose to maintain a dynamically allocated function pointer table (and continue allocating and freeing it on demand based on ref counts).  Assuming the dll/so containing the functions does not change during the lifecycle of the agent controller, this fix seems safe to me.

Do we have anyone who was involved with this code to comment on why it was implemented dynamically in the first place?  I can't imagine it was to save memory, as the memory consumed by the table is negligible...
Comment 22 Harm Sluiman CLA 2008-05-30 16:10:07 EDT
This code originated with Richard Duggan and then Bob Duncan and Hendra Suwanda, all who have long left the project. 
At the time, the use of the C CBE code was expected to be broad, and although the footprint in a CBE is not such a big deal, in some systems the volume of CBE was expected to be high, so little things add up quickly. This was a focus for the C and Java and other CBE implementations. So I fully expect the implementation was in fact worried about the footprint as well as the public review it got as general open source tends to get.
However CBE became WEF and this implementation has seen it's peek usage pass, so such things are not a worry. CBE in general is no longer under great demand.
I had a look at the code as well and it seems fine to me.
Comment 23 Joe Toomey CLA 2008-05-30 20:51:02 EDT
Based on Harm's feedback, I suggest we proceed with this fix.  Note that based on Stanislav's characterization, this problem is not limited to DEBUG logging level, only more likely to occur due to the increased frequency of log events.  This could contribute to other observed AC instability as well.
Comment 24 Paul Slauenwhite CLA 2008-05-30 20:54:45 EDT
(In reply to comment #23)
> Based on Harm's feedback, I suggest we proceed with this fix.  Note that based
> on Stanislav's characterization, this problem is not limited to DEBUG logging
> level, only more likely to occur due to the increased frequency of log events. 
> This could contribute to other observed AC instability as well.
> 

I agree.  This capability is key to our serviceability.
Comment 25 jkubasta CLA 2008-05-30 23:09:21 EDT
patch committed to Head with PMC approval - copyrights updated on check in
Comment 26 Joel Cayne CLA 2008-06-02 16:50:44 EDT
Kiryl,

Please verify that the AC build has picked up the changed CBE plug-ins.

Thanks
Comment 27 Kiryl Kazakevich CLA 2008-06-04 10:50:39 EDT
Verified with TPTP-4.5.0RC3.
Comment 28 Paul Slauenwhite CLA 2009-06-30 12:12:24 EDT
As of TPTP 4.6.0, TPTP is in maintenance mode and focusing on improving quality by resolving relevant enhancements/defects and increasing test coverage through test creation, automation, Build Verification Tests (BVTs), and expanded run-time execution. As part of the TPTP Bugzilla housecleaning process (see http://wiki.eclipse.org/Bugzilla_Housecleaning_Processes), this enhancement/defect is verified/closed by the Project Lead since this enhancement/defect has been resolved and unverified for more than 1 year and considered to be fixed. If this enhancement/defect is still unresolved and reproducible in the latest TPTP release (http://www.eclipse.org/tptp/home/downloads/), please re-open.