Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 353835 - [zoodiscovery] hang in WatchManager.publish
Summary: [zoodiscovery] hang in WatchManager.publish
Status: RESOLVED FIXED
Alias: None
Product: ECF
Classification: RT
Component: ecf.discovery (show other bugs)
Version: 3.5.0   Edit
Hardware: PC Windows XP
: P3 critical (vote)
Target Milestone: 3.5.2   Edit
Assignee: Scott Lewis CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-08-03 20:26 EDT by Scott Lewis CLA
Modified: 2011-09-05 12:31 EDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Scott Lewis CLA 2011-08-03 20:26:18 EDT
With server usage of the zoodiscovery provider, I'm seen regular thread hangs at this point in the stack:

Thread [pool-3-thread-2] (Suspended)    
   Object.wait(long) line: not available [native method]    
   WatchManager$Lock(Object).wait() line: 485    
   WatchManager.publish(AdvertisedService) line: 88    
   ZooDiscoveryContainer.registerService(IServiceInfo) line: 400    
   EndpointDescriptionAdvertiser.doDiscovery(IDiscoveryAdvertiser, IServiceInfo, boolean) line: 45    
   EndpointDescriptionAdvertiser.doDiscovery(EndpointDescription, boolean) line: 109    
   EndpointDescriptionAdvertiser.advertise(EndpointDescription) line: 38    
   BasicTopologyManager(AbstractTopologyManager).advertiseEndpointDescription(EndpointDescription) line: 159    
   BasicTopologyManager(AbstractTopologyManager).advertiseEndpointDescriptions(List<EndpointDescription>) line: 338    
   BasicTopologyManager(AbstractTopologyManager).handleServiceRegistering(ServiceReference) line: 332    
   BasicTopologyManager(AbstractTopologyManager).handleEvent(ServiceEvent, Collection) line: 257    
...

The problem appears to be that the publish method is racing against any calls to WatchManager.watch()...which has the notifyAll() for the associated WatchManager$Lock(Object).wait() in publish().  

In the cases where I've seen this most frequently, there are *no* system properties set for zookeeper provider, meaning that the defaults are what is used.
Comment 1 Scott Lewis CLA 2011-08-03 20:29:02 EDT
Setting target milestone, and adding people to cc list.
Comment 2 Scott Lewis CLA 2011-08-09 00:41:37 EDT
Moving to critical, as I think this has a number of far-reaching effects for users of zookeeper discovery...especially in server environments.
Comment 3 Ahmed Aadel CLA 2011-08-24 04:22:03 EDT
As I cannot reproduce your case, would you please test the following and see if it fixes the problem. Please swap method:  WatchManager.publish()  with: 

public void publish(AdvertisedService published) {
		Assert.isNotNull(published);
		String serviceid = published.getServiceID().getServiceTypeID().getInternal();
		if (getNodeWriters().containsKey(serviceid))
			return;
		try {
			/* wait for the server to get ready */
			while (!writeRootLock.isOpen())
				Thread.sleep(300);
		} catch (InterruptedException e) {
			Logger.log(LogService.LOG_DEBUG, e.getMessage(), e);
		}
		NodeWriter nodeWriter = new NodeWriter(published, writeRoot);
		getNodeWriters().put(serviceid, nodeWriter);
		allKnownServices.put(published.getServiceID().getName(), published);
		nodeWriter.publish();
	}
Comment 4 Scott Lewis CLA 2011-08-24 12:42:22 EDT
(In reply to comment #3)

Thanks for the patch Ahmed.

I've applied the patch and done some very basic regression testing and so far I haven't been able to reproduce the hang.  That's good, of course...but I have a clarifying question:  How is it guaranteed that !writeRootLock.isOpen() eventually will fail (and sleeping will stop)?  Is it via successful completion of the watch() method?

Also...there are two publish methods (i.e. AdvertisedService and ServiceReference).  Should the ServiceReference one be modified as well?  If so, please provide another code fragment as before.

Proposed fix pushed to master:

http://git.eclipse.org/c/ecf/org.eclipse.ecf.git/commit/?id=af351781639f9798e185a329186970771ee24680

I'll leave bug open for continued/additional testing over next few days.  Please report any testing here.
Comment 5 Scott Lewis CLA 2011-08-28 23:53:50 EDT
In my tests so far, the issue has not shown itself.  We'll hope it's gone and not to return.  I'll resolve this bug and reopen if necessary.
Comment 6 Scott Lewis CLA 2011-08-28 23:54:19 EDT
Changing target milestone to 3.6
Comment 7 Wim Jongman CLA 2011-09-05 06:30:44 EDT
(In reply to comment #6)
> Changing target milestone to 3.6

Why not 3.5.2?
Comment 8 Scott Lewis CLA 2011-09-05 12:19:26 EDT
This bug is resolved (for 3.5.2), so my changing to 3.6 target was a mistake.
Comment 9 Scott Lewis CLA 2011-09-05 12:31:46 EDT
Changing back to 3.5.2 as per comment 8