Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 323208

Summary: Distribution Provider exceeds startup time limit (5 sec) if existing services are present
Product: [RT] ECF Reporter: Bryan Hunt <bhunt>
Component: ecf.remoteservicesAssignee: Markus Kuppe <bugs.eclipse.org>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: bugs.eclipse.org, slewis
Version: 3.3.0   
Target Milestone: 3.4.0   
Hardware: All   
OS: All   
Whiteboard:
Attachments:
Description Flags
testcase
none
Register existing services in a dedicated thread and not inside the OSGi startup thread
none
mylyn/context/zip none

Description Bryan Hunt CLA 2010-08-19 23:02:56 EDT
Created attachment 177060 [details]
testcase

I'm using Eclipse 3.6 with ECF 3.3

I was trying to reproduce the problem I encountered with multiple hosts on the same machine with zookeeper when I ran into an exception when using jmdns.  To reproduce, unzip the attached workspace, run the consumer, run the first host, then run the second host.  You should get the following exception:


!MESSAGE While loading class "org.eclipse.ecf.internal.osgi.services.distribution.DiscoveredServiceTrackerImpl$DiscoveredEndpointEvent", thread "Thread[JMDNS Discovery Thread,5,main]" timed out waiting (5000ms) for thread "Thread[Start Level Event Dispatcher,5,main]" to finish starting bundle "org.eclipse.ecf.osgi.services.distribution_1.2.0.v20100615-1451 [21]". To avoid deadlock, thread "Thread[JMDNS Discovery Thread,5,main]" is proceeding but "org.eclipse.ecf.internal.osgi.services.distribution.DiscoveredServiceTrackerImpl$DiscoveredEndpointEvent" may not be fully initialized.
!STACK 0
org.osgi.framework.BundleException: State change in progress for bundle "initial@reference:file:../../workspaces/osgi-remote-services/.metadata/.plugins/org.eclipse.pde.core/.bundle_pool/plugins/org.eclipse.ecf.osgi.services.distribution_1.2.0.v20100615-1451.jar/" by thread "Start Level Event Dispatcher".
	at org.eclipse.osgi.framework.internal.core.AbstractBundle.beginStateChange(AbstractBundle.java:1071)
	at org.eclipse.osgi.framework.internal.core.AbstractBundle.start(AbstractBundle.java:281)
	at org.eclipse.osgi.framework.util.SecureAction.start(SecureAction.java:440)
	at org.eclipse.osgi.internal.loader.BundleLoader.setLazyTrigger(BundleLoader.java:265)
	at org.eclipse.core.runtime.internal.adaptor.EclipseLazyStarter.postFindLocalClass(EclipseLazyStarter.java:106)
	at org.eclipse.osgi.baseadaptor.loader.ClasspathManager.findLocalClass(ClasspathManager.java:453)
	at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.findLocalClass(DefaultClassLoader.java:216)
	at org.eclipse.osgi.internal.loader.BundleLoader.findLocalClass(BundleLoader.java:393)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:469)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:422)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:410)
	at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(DefaultClassLoader.java:107)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
	at org.eclipse.ecf.internal.osgi.services.distribution.DiscoveredServiceTrackerImpl.serviceChanged(DiscoveredServiceTrackerImpl.java:151)
	at org.eclipse.ecf.internal.osgi.services.discovery.ServicePublicationHandler.notifyDiscoveredServiceTrackers(ServicePublicationHandler.java:103)
	at org.eclipse.ecf.internal.osgi.services.discovery.ServicePublicationHandler.serviceDiscovered(ServicePublicationHandler.java:73)
	at org.eclipse.ecf.discovery.AbstractDiscoveryContainerAdapter.fireServiceDiscovered(AbstractDiscoveryContainerAdapter.java:120)
	at org.eclipse.ecf.provider.jmdns.container.JMDNSDiscoveryContainer.fireDiscovered(JMDNSDiscoveryContainer.java:366)
	at org.eclipse.ecf.provider.jmdns.container.JMDNSDiscoveryContainer$2.run(JMDNSDiscoveryContainer.java:327)
	at org.eclipse.ecf.provider.jmdns.container.JMDNSDiscoveryContainer$1.run(JMDNSDiscoveryContainer.java:125)
	at java.lang.Thread.run(Thread.java:637)
Caused by: org.eclipse.osgi.framework.internal.core.AbstractBundle$BundleStatusException
	... 21 more
Root exception:
org.eclipse.osgi.framework.internal.core.AbstractBundle$BundleStatusException
	at org.eclipse.osgi.framework.internal.core.AbstractBundle.beginStateChange(AbstractBundle.java:1071)
	at org.eclipse.osgi.framework.internal.core.AbstractBundle.start(AbstractBundle.java:281)
	at org.eclipse.osgi.framework.util.SecureAction.start(SecureAction.java:440)
	at org.eclipse.osgi.internal.loader.BundleLoader.setLazyTrigger(BundleLoader.java:265)
	at org.eclipse.core.runtime.internal.adaptor.EclipseLazyStarter.postFindLocalClass(EclipseLazyStarter.java:106)
	at org.eclipse.osgi.baseadaptor.loader.ClasspathManager.findLocalClass(ClasspathManager.java:453)
	at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.findLocalClass(DefaultClassLoader.java:216)
	at org.eclipse.osgi.internal.loader.BundleLoader.findLocalClass(BundleLoader.java:393)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:469)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:422)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:410)
	at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(DefaultClassLoader.java:107)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
	at org.eclipse.ecf.internal.osgi.services.distribution.DiscoveredServiceTrackerImpl.serviceChanged(DiscoveredServiceTrackerImpl.java:151)
	at org.eclipse.ecf.internal.osgi.services.discovery.ServicePublicationHandler.notifyDiscoveredServiceTrackers(ServicePublicationHandler.java:103)
	at org.eclipse.ecf.internal.osgi.services.discovery.ServicePublicationHandler.serviceDiscovered(ServicePublicationHandler.java:73)
	at org.eclipse.ecf.discovery.AbstractDiscoveryContainerAdapter.fireServiceDiscovered(AbstractDiscoveryContainerAdapter.java:120)
	at org.eclipse.ecf.provider.jmdns.container.JMDNSDiscoveryContainer.fireDiscovered(JMDNSDiscoveryContainer.java:366)
	at org.eclipse.ecf.provider.jmdns.container.JMDNSDiscoveryContainer$2.run(JMDNSDiscoveryContainer.java:327)
	at org.eclipse.ecf.provider.jmdns.container.JMDNSDiscoveryContainer$1.run(JMDNSDiscoveryContainer.java:125)
	at java.lang.Thread.run(Thread.java:637)
Comment 1 Bryan Hunt CLA 2010-08-19 23:28:52 EDT
I just realized that I'm using a slightly newer version of eclipse / equinox than the 3.6 release.  You may have to get the M201000806 build to reproduce.
Comment 2 Scott Lewis CLA 2010-08-19 23:31:36 EDT
Which process shows this exception?  The consumer?

This is currently a bit of a mystery.  It *seems* like the root exception is a failed classload for a class accessed by this line in the stack trace

org.eclipse.ecf.internal.osgi.services.distribution.DiscoveredServiceTrackerImpl.serviceChanged(DiscoveredServiceTrackerImpl.java:151)

In my copy of DiscoveredServiceTrackerImpl this seems to be on the following line

try {
	// If the service endpoint description is not ECF's then we
	// don't process it
	adesc = getECFDescription(notification				.getServiceEndpointDescription());
>>> line 151 } catch (Exception e) {
	logError("serviceChanged.AVAILABLE", //$NON-NLS-1$
	"Error creating ECF endpoint description", e); //$NON-NLS-1$
	return;
}


Line 151 seems to be the catch (Exception e), which means that the classload is an exception class...which leads to two questions:  1) why is getECFDescription throwing an exception, and 2) what exception class is it trying and failing to load?  (and why is it reporting line 151)?

Would it be possible for you to debug into getECFDescription and attempt to determin why getECFDescription is throwing an exception?  Thanks.
Comment 3 Bryan Hunt CLA 2010-08-20 00:13:54 EDT
I switched my target platform to be based on the 3.6 release, and started over with the example code freshly extracted from CVS.  The exception is thrown on the host without modifying the example.  If I step through DiscoveredServiceTrackerImpl in the debugger, the exception is not thrown.  I'm guessing that since this exception is triggered by a 5 second timeout, pausing the code in the debugger is allowing the services.distribution bundle to finish starting.

Could the services.distribution bundle take more than 5 seconds to start?
Comment 4 Bryan Hunt CLA 2010-08-20 00:18:24 EDT
As an experiment, I changed the start level of the services.distribution bundle to 3 and the exception goes away.  When I put the start level back to default, I get the exception.
Comment 5 Scott Lewis CLA 2010-08-28 16:41:11 EDT
Adding Markus to cc for comment/visibility into this issue.
Comment 6 Markus Kuppe CLA 2010-09-09 07:56:23 EDT
Might this be another symptom of bug #254684 ?
Comment 7 Scott Lewis CLA 2010-09-09 08:50:54 EDT
(In reply to comment #6)
> Might this be another symptom of bug #254684 ?

Seems possible...but I don't really see much evidence of that...given bug 254684 has shown up as class circularity error/linkage error and this is not showing that.

In any event, perhaps Bryan will get a more recent version of ECF (with the fix to 254684) and attempt to reproduce this problem with the fix for 254684?  That would be appreciated.
Comment 8 Markus Kuppe CLA 2010-09-25 07:00:37 EDT
Created attachment 179566 [details]
Register existing services in a dedicated thread and not inside the OSGi startup thread
Comment 9 Markus Kuppe CLA 2010-09-25 07:03:35 EDT
As hinted by Bryan the root cause is that org.eclipse.ecf.internal.osgi.services.distribution.Activator.start(BundleContext) takes more than five seconds to start if existing remote services are started prior to the distribution provider. Registering those service registration in a separate thread solves the problem for me.

Bryan, please report back if the patch fixes your problem and does not create new ones. Thanks
Comment 10 Markus Kuppe CLA 2010-09-25 07:03:39 EDT
Created attachment 179567 [details]
mylyn/context/zip
Comment 11 Markus Kuppe CLA 2010-09-25 07:09:30 EDT
What appears to consume a considerable amount of time is org.eclipse.ecf.core.identity.GUID.GUID(Namespace, String, String, int) acquiring enough entropy for SecureRandom.nextBytes(byte[]). Do we need to use the secure version of random?
Comment 12 Bryan Hunt CLA 2010-09-25 09:02:21 EDT
Thanks Markus, I'll give this a try.  Is the patch against 3.3.0?  It may be some time before I can test it as I'm a couple of developers short of a full team.
Comment 13 Scott Lewis CLA 2010-09-25 10:28:50 EDT
(In reply to comment #11)
> What appears to consume a considerable amount of time is
> org.eclipse.ecf.core.identity.GUID.GUID(Namespace, String, String, int)
> acquiring enough entropy for SecureRandom.nextBytes(byte[]). Do we need to use
> the secure version of random?

Hmmm.  By default, we're using the sha1 impl in SecureRandom (i.e. the vm default).  This *is* needed to get a properly unique GUID created (i.e. with the level of randomness guaranteed by sha1 anyway).

One way to avoid/workaround would be to initialize SecureRandom (doesn't even have to be GUID) in some startup code...but putting that initialization code somewhere else in ECF would not really fix the problem from ECF consumers point of view.
Comment 14 Markus Kuppe CLA 2010-09-26 02:31:52 EDT
(In reply to comment #13)
> Hmmm.  By default, we're using the sha1 impl in SecureRandom (i.e. the vm
> default).  This *is* needed to get a properly unique GUID created (i.e. with
> the level of randomness guaranteed by sha1 anyway).

But do we need cryptographically secure randomness? How I read, we just have to avoid collisions?!

> One way to avoid/workaround would be to initialize SecureRandom (doesn't even
> have to be GUID) in some startup code...but putting that initialization code
> somewhere else in ECF would not really fix the problem from ECF consumers point
> of view.

I'm not certain it is the initialization. What I can think of is, that if the system's entropy cache is depleted, calling SecureRandom.nextBytes(byte[]) just blocks until enough "new" entropy has been gathered.
For servers it is usually hard to gather true entropy and thus the cache is regularly at the lower limit.

Anyway, by moving the registerService code into a separate thread, the behavior mentioned in this bug won't happen anymore.
Comment 15 Markus Kuppe CLA 2010-09-26 02:34:17 EDT
Releasing patch to HEAD