Community
Participate
Working Groups
With server usage of the zoodiscovery provider, I'm seen regular thread hangs at this point in the stack: Thread [pool-3-thread-2] (Suspended) Object.wait(long) line: not available [native method] WatchManager$Lock(Object).wait() line: 485 WatchManager.publish(AdvertisedService) line: 88 ZooDiscoveryContainer.registerService(IServiceInfo) line: 400 EndpointDescriptionAdvertiser.doDiscovery(IDiscoveryAdvertiser, IServiceInfo, boolean) line: 45 EndpointDescriptionAdvertiser.doDiscovery(EndpointDescription, boolean) line: 109 EndpointDescriptionAdvertiser.advertise(EndpointDescription) line: 38 BasicTopologyManager(AbstractTopologyManager).advertiseEndpointDescription(EndpointDescription) line: 159 BasicTopologyManager(AbstractTopologyManager).advertiseEndpointDescriptions(List<EndpointDescription>) line: 338 BasicTopologyManager(AbstractTopologyManager).handleServiceRegistering(ServiceReference) line: 332 BasicTopologyManager(AbstractTopologyManager).handleEvent(ServiceEvent, Collection) line: 257 ... The problem appears to be that the publish method is racing against any calls to WatchManager.watch()...which has the notifyAll() for the associated WatchManager$Lock(Object).wait() in publish(). In the cases where I've seen this most frequently, there are *no* system properties set for zookeeper provider, meaning that the defaults are what is used.
Setting target milestone, and adding people to cc list.
Moving to critical, as I think this has a number of far-reaching effects for users of zookeeper discovery...especially in server environments.
As I cannot reproduce your case, would you please test the following and see if it fixes the problem. Please swap method: WatchManager.publish() with: public void publish(AdvertisedService published) { Assert.isNotNull(published); String serviceid = published.getServiceID().getServiceTypeID().getInternal(); if (getNodeWriters().containsKey(serviceid)) return; try { /* wait for the server to get ready */ while (!writeRootLock.isOpen()) Thread.sleep(300); } catch (InterruptedException e) { Logger.log(LogService.LOG_DEBUG, e.getMessage(), e); } NodeWriter nodeWriter = new NodeWriter(published, writeRoot); getNodeWriters().put(serviceid, nodeWriter); allKnownServices.put(published.getServiceID().getName(), published); nodeWriter.publish(); }
(In reply to comment #3) Thanks for the patch Ahmed. I've applied the patch and done some very basic regression testing and so far I haven't been able to reproduce the hang. That's good, of course...but I have a clarifying question: How is it guaranteed that !writeRootLock.isOpen() eventually will fail (and sleeping will stop)? Is it via successful completion of the watch() method? Also...there are two publish methods (i.e. AdvertisedService and ServiceReference). Should the ServiceReference one be modified as well? If so, please provide another code fragment as before. Proposed fix pushed to master: http://git.eclipse.org/c/ecf/org.eclipse.ecf.git/commit/?id=af351781639f9798e185a329186970771ee24680 I'll leave bug open for continued/additional testing over next few days. Please report any testing here.
In my tests so far, the issue has not shown itself. We'll hope it's gone and not to return. I'll resolve this bug and reopen if necessary.
Changing target milestone to 3.6
(In reply to comment #6) > Changing target milestone to 3.6 Why not 3.5.2?
This bug is resolved (for 3.5.2), so my changing to 3.6 target was a mistake.
Changing back to 3.5.2 as per comment 8