Community
Participate
Working Groups
We've been getting a number of reliablity defects filed during our consuming product's system test using the new AC. I have been running this down for several days, and the good news is that I now have a readily reproducible test case showing the type of failure we've been encountering. Hopefully it's the same problem. I have no reason to assume that this is caused by the compatibility layer -- I only mention that we are connecting to the new AC using the org.eclipse.hyades.execution.local.NodeImpl.connect() method, which uses the compatibility layer. Attached is a plugin that you can drop into a TPTP workbench. It adds a new menu called ACDiag with a menu item called Session Connect. (There's also a button with the eclipse icon added that does the same thing.) Choosing this action, you will be prompted for the hostname to connect to, the number of times you want to connect & release, and the number of milliseconds to sleep between connect attempts. The connections are done in series, with a sleep() call of the specified length between each release and subsequent connect call (unless you specify 0 for the pause time, in which case no sleep call is made.) The plugin opens a console window (called AC Session Console) to print its output. It prints each successful connect & release call, and if an exception is thrown, it prints the stack trace. Running this plugin with the new agent controller either locally or remotely, and taking the default values for iterations (20) and pause time (0), we consistently fail to connect somewhere between the 3rd and 13th connection (in my testing.) Once we fail to connect the first time, all subsequent connection attempts also fail, and the AC appears to stop responding to others requests as well (for example, calls to org.eclipse.hyades.internal.execution.local.control.NodeImpl.getPropertyValues(String, String) return null, presumably because they are eating the exception on line 428.) Recycling the workbench instance does not return the Agent Controller to a good state -- you have to restart it for it to be usable again. I find different things logged in the serviceconfig.log file when these failures happen. Sometimes it appears that the AC stops writing to the log during this failure situation. Other times, I see any of the following suspicious entries: 1) PC handleProcessEvents: Process has exited with status 259. 2) Error writing to socket while sending message to client 3) Received(174 bytes). Validated message of 174 bytes Try count XXX and connect result - -1 Ready to accept next named pipe request. (Where XXX is an increasing number, and the entry appears over & over again.) Running this same test with the same options against the old windows agent controller (locally or remotely) runs successfully for large number of connections (I tested 200). If you increase the wait time to 3000 milliseconds, the new AC succeeds as well, which is an interesting data point. The source for the test plugin contains source code so you can see exactly what it's doing.
Created attachment 51813 [details] Plugin that demonstrates the defect
Found in 4.2.1 and reproducible in 4.3
*** Bug 156713 has been marked as a duplicate of this bug. ***
Thanks Joe for going the extra mile to come up with a reproducible use case of this problem. This is something that we should be able to automate using Navid's automation framework. I'm CCing Navid and Thay for them to consider creating test cases that will stress test the Agent Controller.
Yes, the automation framework can easily be used to automate such a scenario. I wish you had contacted me before you created the ACDiag menu. Thay is the new owner of the automation framework so I'll leave it up to him to follow up on this. Thay, let me know if you have any questions.
I don't know anything about how to use or add to the automation framework, but I think I sent a request several weeks ago to have a demo/presentation of this at the AG call. This is the type of thing we need to socialize to get people contributing to the automation bucket, and I'd like to know more about what it can do and how I can use it. In terms of this defect, I spent about 3 days chasing down the problem itself, and probably 20 minutes writing the plugin code for the ACDiag menu. It was really just a way for me to repeatedly and quickly run some test code in a runtime workbench with varying inputs, and it should prove useful in understanding the root cause of the problem (like why does it fail less frequently when we sleep longer between connect attempts?) I definitely agree that the logical next step is to add the test logic to an automated test with some known interesting inputs.
Joe, the presentation you requested in the AG call was for a different feature. That was for integrating our automated test cases with the builds. That feature is still in progress. At the point when you requested the demo, we had the installation and set up of the AC automated. We're planing to do a presentation once the feature is completed. The automation framework I'm referring to in comment #5 is the framework that runs the automated test cases for AC. Its introductory guide is in CVS: test-results/platform/org.eclipse.tptp.ac.testautomation/automation-files/notes/ACTestAutomation.doc
I can reproduce this consistently with Joe's plugin. It turns out its not really a connection problem. The connections are being established, but the first command the client tries to send (once we're in the error state) causes the connection's command processing thread to hang, which apparently looks to the client like the connection failed. I think I've tracked it down to a situation where the client is requesting a process to be launched, and we're never coming out of the startProcess command in the compatibility layer, and therefore never releasing the processList mutex which all subsequent command handlers try to acquire. I hope to have a fix for this soon.
I managed to trace this to a condition where one thread in the compatibility layer acquired the process list lock and wasn't going to release it until the process launch was complete. Meanwhile, another module sent a named pipe message to notify us that a previous process had exited. The process exited event handler blocked trying to acquire the process list lock, and because this handler blocked, the named pipe transport layer block was blocked from processing further messages. Unfortunately, the thread trying to launch the process was trying to send a named pipe command to set up the console for the process to be launched. So we have three mutually dependent threads deadlocked. I have a fix for this wherein the thread handling the process launch command releases the processList lock while it is waiting for notification that the launch completed. Ideally, the compatibility layer would only use the processList launch in the very narrow bits of code where it is actually modifying the process list, but this lock is being used more broadly than that now (in a way copied from the RAC) and we don't want to risk destabilizing the whole compatibility layer at this stage. With the above fix in place things work much better. Unfortunately, things still aren't perfect. Without the fix I was generally seeing failure within 5 iterations and once it started failing, you had to kill the AC, as Joe said. With the fix, I can often complete 20 or more iterations of the test plugin, but I am still seeing occaisional failures. With these new failures, I can often get more successes just by restarting the test -- the old job is hung, but the new job starts working. At this time I have no idea where the cause of these new failures might be.
I agree that we should eliminate any "hold and wait" scenarios that we can find. They're often deadlocks waiting to happen, as you point out. Do you feel confidant that this problem is confined to the combatibiliy layer (i.e. the same type of test connecting via the new APIs can run indefinitely without experiencing this type of failure?) If so, I think we should spend the effort required to review the synchronization in the compatibility layer and try to ferret out the remaining issues. I guess I'm also somewhat troubled to learn that the named pipe transport layer allows a handler to block while processing of a given message (if I understood your comment correctly.) I might be naive in my thinking, (or more likely just ignorant), but I'd expect the sematics of the transport layer would ensure delivery of a message but not processing of it (and any messages requiring notification of completion would do so with a specific return message once the processing was complete.) If handlers can block and prevent the transport layer from delivering future messages, it seems like this type of problem could happen in other areas as well. I'll admit it's been a while since I reviewed the new AC design -- did I misunderstand your description?
Unfortunately, the compatibility layer (and the RAC) very frequently holds the processList longer than it needs to. It would be easy enough to clean this up, but it would change the semantics of how it works and could introduce additional problems. For instance, the current implementation has the effect of making processing of incoming messages roughly atomic (at least with respect to additional messages coming in the same direction). If we change the code to only hold the lock when it needs it, it might just work or it might introduce a flurry of hard-to-reproduce timing bugs. There are a couple of other hold-and-wait situations, which are definitely problems waiting to happen. Changing them the way I've changed the one isolated above would reduce this risk (or at least change it to a different risk) but shouldn't be confused with a correct fix. Frankly, I'm not sure which way to go with this one. On the other topic, the named pipe handler is in a loop doing this: while (running) connect to pipe read message(s) process message(s) disconnect from pipe Obviously this will keep additional messages from being read while the current message is being processed. In a perfect world, perhaps we'd be processing messages with each message in a different thread. That would keep any one message handler from blocking the whole works, but the cost of thread startup would likely be prohibitive. We could implement a thread pool that would approximate the one-thread-per-message scenario, but it would be a fairly significant change to the implementation. I think something like this happens at each message handling stop, regardless of transport layer.
I've submitted the following for creating stress test cases for the Agent Controller. https://bugs.eclipse.org/bugs/show_bug.cgi?id=160940 I believe we can avoid problems such as this if we properly stress test the Agent Controller. I've marked the severity of the above defect as an enhancement, please let me know if you think it should have a higher severity.
I agree with your concern about changing the semantics of commands through the compatibility layer -- we don't want to make matters worse. For this defect, I suggest that when i3 opens, you make the fix you have in hand (which is clearly the most serious part of the problem) and then try to track down the less frequent failure specifically (rather than overhauling the code at this point.) I'm afraid this will continue to be a problem for TPTP and its consuming products, though a less frequent one. I think we need to get to the bottom of this issue and know that we're as stable with the compatibility layer as we are with the old AC. Thanks for digging in on this one. On the other topic, your suggestion of maintaining a thread pool to process messages sounds like a good one. Based on your pseudo-code, I'd infer that the semantics of the command from the sender's perspective would not be altered if you made this type of enhancement in the future (i.e. you don't send any sort of ack after the process message(s) call before the pipe disconnect.) The fact that you apparently can read and process many messages at one time while blocking the transport seems to argue for this type of enhancement as well. Am I correct to infer that only AC internal code can register as handlers in this fashion (i.e. I couldn't create my own agent and write a handler that gets this type of access)? I think this is a natural part of the evolution of a subsystem -- if it turns out to be an issue, we can file the enhancement request and plan the work accordingly. This second part of the discussion is probably not related to this defect and can be moved onto the mailing list or a design call if people prefer. Thanks again, Andy.
Are there any updates on this defect? Any patches available for us to try?
I thought I had a major improvement, but it turns out the problem is just sensitive to how much activity is going on while the test is running. Even with my proposed "fix" as described above, I'm seeing a lot of failures. I'm trying a new approach, preventing the message delivery mechanism from blocking. It has introduced some new snags, which I'm working through now.
Created attachment 52868 [details] Proposed patch for unreliable connection fix
This was an ugly one to track down. It turns out that there were (at least) two problems. The first problem was that the client compatibility code was holding a write lock on the process list while waiting for notifcation that a new process had been launched. If, while it was waiting, a previous process terminated, the notification of that termination got blocked trying to get the process list lock and this blocked the named pipe mechanism so the process launch notification never came through and the process list lock never got released. Deadlock all around. That one was comparatively easy to find. The second problem was that the named pipe mechanism would occaisionally get blocked without an obvious cause. One module was blocked waiting to open the named pipe, and another was blocked waiting for someone to write to the named pipe, but it didn't look like anyone actually had it open for writing. It turned out that when we were creating and opening the named pipe we were setting the flag to have them be inherited by child processes (this is needed for console support). It turned out that if we launched a process in one thread while a second (unrelated) thread had the named pipe open for writing, the child process inherited the open handle, and the writing thread closing it didn't free the pipe. To fix this, I added a parameter to the function that opens the named pipe so that only console pipes get the inherit flag set. The proposed fix is attached as a patch file.
this is great news. can you make the patch available in a binary form so that we can try it out?
We are having a problem testing this fix. The AC only allows one playback then seems to stop responding to commands. We have to stop and restart it to do a subsequent playback. Behavior has been reproduced using URL Test.
I found a third problem where in two places we had code that did something like this: proces = findProcess( processList ); ...do some things.... enterMutex( processList->Lock ); ..do some other things... exitMutex( processList->Lock ); The problem was that the process pointer we obtained outside the lock wasn't always valid by the time we got inside the lock (i.e. whoever held the lock deleted the process). Simply moving the findProcess call inside the lock cleared up the problem. With this fix, I can run the sample plugin as many as 10000 iterations, by which time it seems to hit memory leak related problems.
Great news. I guess that means that with these fixes, we can use the plugin to investigate memory leaks that still exist? Thanks Andy!
Re-opening this defect. Although things are greatly improved, we still have reliability problems running base TPTP tests with the backward compatibility layer on top of the new AC. (Our consuming products have the same issues as well.) I am attaching a new plugin to demonstrate this failure (which I also attached to 159883.) Unzip the attached zip file, and drop the TPTPStress plugin into your eclipse plugins directory. Run eclipse and import the project in the JUnitProject.zip into your workspace. Open the project in the package explorer or Resource Perspective navigator, open the src folder, then open the junitproject folder. Select JUnit1.testsuite and then click on the ACDiag->TPTP Stress menu. It will prompt you for how many times you want to execute the selected test(s). If you run the test 50 times, it will usually fail (hang during test execution or launch). These failures are accompanied by the following message in the log file: Agent Controller allowed multiple clients to attach to a legacy agent. Subsequent launch attempts can succeed, but once you have hit this error once, it reappears very regularly. (i.e. it may take 30-60 times to hit the error the first time, but you will subsequently hit it every 2-5 times.)
Created attachment 53419 [details] AC log files showing the failures (search for "legacy agent")
Created attachment 53420 [details] Plugin demonstrating the failure
Joe, I'm not sure I know how to follow the test procedure. I see a TPTPStress.jar file in the attached zip file, but not a TPTPStress plugin. How do I get this in Eclipse as a plugin?
Hi Andy, The TPTPStress.jar file is a plugin. Just drop that jar file into your eclipse plugins directory and you should be good to go. (Many but not all eclipse plugins ship as jar files now.) Thanks, --Joe
OK, now I've got the plugin in place and imported the JUnitTest project, but when I run the TPTP Stress command, it doesn't seem to be connecting to my AC. It produces the output saying that each iteration of the test completed, but it's not connecting to the AC socket. If I go to the Test perspective and run the JUnit test from there, it connects to the AC. I tried disabled the IAC. Any additional ideas?
Hi Andy, I just tried this, and I see the problem. I'm not sure why, but the plugin only seems to work from a runtime workbench. I'll look into that, but in the meantime, you can get around this by launching a runtime workbench from your current eclipse instance (where you have the plugin installed) and importing the JUnitProject into that new workspace. Then try again, and it should work. (The problem is not the IAC. For some reason, the test is not being executed at all.)
Created attachment 53515 [details] Updated test plugin This updated plugin doesn't require running from the runtime workbench. I need to file a defect for the reason for this change (found a bug on the ASF side!)
Andy has a fix which he has sent to Joe for testing.
The latest problem in this chain is that the Agent compatibility layer and the client compatibility layer are maintaining separate lists of processes and agents, and while the client layer was cleaning its list, the agent layer wasn't. Given enough identical iterations, we eventually hit a case where Windows was recycling a process ID that we had seen before and we mapped it to the old data. I made some changes to try to address this, which is what I sent to Joe before. That, unfortunately, introduced some new problems. I think I have those fixed now and my latest version, which I'll be attaching following this comment, should be much better.
Created attachment 53595 [details] Zipped DLLs to address the problems uncovered by Joe's stress plugin Unzip these files and overwrite the corresponding DLLs in the latest 4.3 build to test this.
Fixes have been checked in.
In addition to patching the leaks, this fix included resetting the default shared memory buffer allocation to match that used by the RAC. Now, if an agent does not specify dataChannelSize in its pluginconfig.xml, 1Mbyte is allocated rather than 64Mbytes. This change affects backwards compatibility mode agents only. (New AC API agents have a default of 8Mbytes.) The 64Mb setting was a leftover in compatibility code from before dataChannelSize settings were being read from the agent's config, as the piAgent required that size.
Was the fix for this defect also submitted in 4.2.1? I'm experiencing a similar problem when testing the 4.2.1.1 driver.
Yes, these changes are checked in to the 4.2.1.1 branch.
As of TPTP 4.6.0, TPTP is in maintenance mode and focusing on improving quality by resolving relevant enhancements/defects and increasing test coverage through test creation, automation, Build Verification Tests (BVTs), and expanded run-time execution. As part of the TPTP Bugzilla housecleaning process (see http://wiki.eclipse.org/Bugzilla_Housecleaning_Processes), this enhancement/defect is verified/closed by the Project Lead since this enhancement/defect has been resolved and unverified for more than 1 year and considered to be fixed. If this enhancement/defect is still unresolved and reproducible in the latest TPTP release (http://www.eclipse.org/tptp/home/downloads/), please re-open.