| Summary: | Possible thread leak in remoteservices | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [RT] ECF | Reporter: | Bryan Hunt <bhunt> | ||||
| Component: | ecf.remoteservices | Assignee: | ecf.core-inbox <ecf.core-inbox> | ||||
| Status: | RESOLVED FIXED | QA Contact: | |||||
| Severity: | normal | ||||||
| Priority: | P3 | CC: | bugs.eclipse.org, slewis | ||||
| Version: | 3.3.0 | ||||||
| Target Milestone: | --- | ||||||
| Hardware: | PC | ||||||
| OS: | Mac OS X - Carbon (unsup.) | ||||||
| Whiteboard: | |||||||
| Attachments: |
|
||||||
|
Description
Bryan Hunt
Created attachment 181189 [details]
console output
The YourKit snapshot is too large to attach. Ping me on skype if you want it. (In reply to comment #0) > I suspect that something in the network is contributing to the problem. I've > been having intermittent problems with the network on my workstation and this > box is where I've been testing the system described above. I have seen cases > where a ping from my home machine to my workstation fails for the first few > pings and then starts working again. IT is looking into the issue, and I > suspect that some of these problems will go away when they fix the network > problem. Even so, ECF and zookeeper should be tolerant of network issues. If the network has indeed intermittent problems, I'm inclined to argue that Zookeeper is behaving correctly. From its point of view the service went down with the network and came back after again. It is just that remoteservices does not dispose the lingering connection. Thanks for the report. (In reply to comment #3) > (In reply to comment #0) > > I suspect that something in the network is contributing to the problem. I've > > been having intermittent problems with the network on my workstation and this > > box is where I've been testing the system described above. I have seen cases > > where a ping from my home machine to my workstation fails for the first few > > pings and then starts working again. IT is looking into the issue, and I > > suspect that some of these problems will go away when they fix the network > > problem. Even so, ECF and zookeeper should be tolerant of network issues. > > If the network has indeed intermittent problems, I'm inclined to argue that > Zookeeper is behaving correctly. From its point of view the service went down > with the network and came back after again. It is just that remoteservices does > not dispose the lingering connection. It may be true that remoteservices is not disposing of the lingering connection...under all necessary failure conditions...but there could still be problems with zookeeper's handling of undiscovery...i.e. via network failure/dropped connections. I think this bug highlights our ongoing need to put together some sort of distributed test harness...so that network failure situations (for both discovery and remote services) can be explicitly tested. The hard part about diagnosing and fixing these sorts of situations will be simply reproducing both the general network environment and the specific failure conditions that Bryan is seeing...and then finding and fixing any problems in remote services and/or discovery. I intend to investigate this as much as possible...but Bryan in the short term it may require some close interaction with you...i.e. to try to reproduce the situations that are leading to these problems in your network and application. Hopefully we will be able to arrange such interaction. In any case, thanks. (In reply to comment #4) > I think this bug highlights our ongoing need to put together some sort of > distributed test harness...so that network failure situations (for both > discovery and remote services) can be explicitly tested. The hard part about > diagnosing and fixing these sorts of situations will be simply reproducing both > the general network environment and the specific failure conditions that Bryan > is seeing...and then finding and fixing any problems in remote services and/or > discovery. I totally agree that testing should be extended to the network layer. Problem is, that last time I checked no (Java) framework existed that allows to simulate the network layer (to write tests that test network outages/package loss/...). OTOH we need to think about the state space explosion. At some point formal verification might just be the only way. Anyway, with the new hardware in place, we will hopefully find the resources (time and host-wise) to start with real remote testing across more than one host. I've had the system described (with one additional remote service - 4 total) running on our production for about 24 hours. This system is in a production lab and I'm not aware of any network issues. At 2:53 AM zookeeper undiscovered and then immediately re-discovered one service. At 12:50 PM zookeeper undiscovered and then immediately re-discovered two services. I dumped javacore and the system has not created any additional ECF threads. (In reply to comment #6) > I've had the system described (with one additional remote service - 4 total) > running on our production for about 24 hours. This system is in a production > lab and I'm not aware of any network issues. At 2:53 AM zookeeper undiscovered > and then immediately re-discovered one service. At 12:50 PM zookeeper > undiscovered and then immediately re-discovered two services. I dumped > javacore and the system has not created any additional ECF threads. Thanks for the report...but I'm not quite clear...did the above sequence result in any obviously extraneous threads? Also...what version of zookeeper and ECF generic provider are you using? One thing I've been thinking about...if zookeeper were to *not* undiscover a service (i.e. not report undiscover because of a zookeeper-server-to-client communication failure)...but the distribution provider didn't detect the failure...e.g. because the failure was transient and within the ECF generic keepalive timeout...and then send a new discovery notification, it could easily result in new client container creation and connection...and of course new threads on both client and server. In any event...please keep collecting data as you can, and we'll do all we can to figure out what's going on specifically in your environment. (In reply to comment #7) > Thanks for the report...but I'm not quite clear...did the above sequence result > in any obviously extraneous threads? Also...what version of zookeeper and ECF > generic provider are you using? No extra threads were created. I was simply providing the observation that ECF appears to randomly undiscover and re-discover services under "normal conditions". I'm using the ECF 3.3 release with the following bundles checked out of head from CVS: org.eclipse.ecf.osgi.services.discovery org.eclipse.ecf.osgi.services.distribution org.eclipse.ecf.remoteservice with patches from bug fixes applied to org.eclipse.ecf.osgi.services.distribution I believe this bug was due to a problem in zookeeper which has since been solved by a move to a more recent version of apache zookeeper, and updates to the ECF zookeeper provider code. If discovery-based threads are being created and not cleaned up with ECF 3.5+, please reopen. |