Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 403628

Summary: Asynchronous context shutdown causes timeouts during LifecycleManager#maybeCloseApplicationContextFor(Bundle) in Apache Felix
Product: [RT] Gemini.Blueprint Reporter: Olaf Otto <olaf>
Component: CoreAssignee: Olaf Otto <olaf>
Status: CLOSED FIXED QA Contact:
Severity: minor    
Priority: P3 CC: fwaibel, glyn.normington, tjwatson
Version: 1.0.2.RELEASE   
Target Milestone: 2.0.0.M02   
Hardware: PC   
OS: Windows 7   
Whiteboard:
Bug Depends on:    
Bug Blocks: 393961    
Attachments:
Description Flags
Patch for the extenderConfiguration and LifecycleManager to make asynchronous context shutdown configurable.
none
Patch for the extenderConfiguration and LifecycleManager to make asynchronous context shutdown configurable. Includes Documentation and LifecycleManager testcase. none

Description Olaf Otto CLA 2013-03-18 09:31:36 EDT
The execution of application context shutdowns in LifecycleManager#maybeCloseApplicationContextFor(...) should not be asynchronous.

Rationale:

1.) The current behavior is likely to introduce temporary deadlocks. The lifecycle method #maybeCloseApplicationContextFor is triggered in a synchronous bundle shutdown event, i.e. during a state-change in the underlying OSGi framework. The application context shutdown is also inevitably affecting framework state, e.g. through the unregistraion of the automatically exported blueprint OSGi services. Thus two threads are concurrently modifying framework state, which is likely to cause deadlocks. I have this very issue upon each bundle refresh in Felix 3.0.x, and it accumulates very badly if multiple blueprint bundles are installed and refreshed, leading to significant downtimes during deployment.

2.) IMO, an application context failing to stop is a programming issue gemini-blueprint is not responsible for. It must be safe to assume that shutting down an application context either executes timely or fails. Anything else can be considered a user mistake.


Example stacktrace showning the issue in Felix:


The gemini shutdown thread (this will time out) trying to aquire the lock held by the thread representing the synchronous event (see below):


"Gemini Blueprint context shutdown thread" daemon prio=6 tid=0x000000000b86c000 nid=0x1878 in Object.wait() [0x0000000010cce000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:485)
	at org.apache.felix.framework.Felix.acquireGlobalLock(Felix.java:4785)
	- locked <0x000000078509a858> (a [Ljava.lang.Object;)
	at org.apache.felix.framework.Felix.access$400(Felix.java:80)
	at org.apache.felix.framework.Felix$FelixResolver.resolve(Felix.java:4043)
	at org.apache.felix.framework.ModuleImpl.searchDynamicImports(ModuleImpl.java:1412)
	at org.apache.felix.framework.ModuleImpl.findClassOrResourceByDelegation(ModuleImpl.java:734)
	at org.apache.felix.framework.ModuleImpl.access$400(ModuleImpl.java:71)
	at org.apache.felix.framework.ModuleImpl$ModuleClassLoader.loadClass(ModuleImpl.java:1768)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
	at org.apache.felix.framework.ModuleImpl.getClassByDelegation(ModuleImpl.java:645)
	at org.apache.felix.framework.ServiceRegistrationImpl$ServiceReferenceImpl.isAssignableTo(ServiceRegistrationImpl.java:519)
	at org.apache.felix.framework.util.Util.isServiceAssignable(Util.java:275)
	at org.apache.felix.framework.util.EventDispatcher.invokeServiceListenerCallback(EventDispatcher.java:855)
	at org.apache.felix.framework.util.EventDispatcher.fireEventImmediately(EventDispatcher.java:733)
	at org.apache.felix.framework.util.EventDispatcher.fireServiceEvent(EventDispatcher.java:662)
	at org.apache.felix.framework.Felix.fireServiceEvent(Felix.java:3735)
	at org.apache.felix.framework.Felix.access$000(Felix.java:80)
	at org.apache.felix.framework.Felix$2.serviceChanged(Felix.java:722)
	at org.apache.felix.framework.ServiceRegistry.unregisterService(ServiceRegistry.java:135)
	at org.apache.felix.framework.ServiceRegistrationImpl.unregister(ServiceRegistrationImpl.java:129)
	at org.eclipse.gemini.blueprint.util.OsgiServiceUtils.unregisterService(OsgiServiceUtils.java:37)
	at org.eclipse.gemini.blueprint.context.support.AbstractOsgiBundleApplicationContext.unpublishContextAsOsgiService(AbstractOsgiBundleApplicationContext.java:376)
	at org.eclipse.gemini.blueprint.context.support.AbstractOsgiBundleApplicationContext.doClose(AbstractOsgiBundleApplicationContext.java:185)
	at org.eclipse.gemini.blueprint.context.support.AbstractDelegatedExecutionApplicationContext.access$501(AbstractDelegatedExecutionApplicationContext.java:60)
	at org.eclipse.gemini.blueprint.context.support.AbstractDelegatedExecutionApplicationContext$2.run(AbstractDelegatedExecutionApplicationContext.java:194)
	at org.eclipse.gemini.blueprint.util.internal.PrivilegedUtils.executeWithCustomTCCL(PrivilegedUtils.java:85)
	at org.eclipse.gemini.blueprint.context.support.AbstractDelegatedExecutionApplicationContext.normalClose(AbstractDelegatedExecutionApplicationContext.java:190)
	at org.eclipse.gemini.blueprint.extender.internal.dependencies.startup.DependencyWaiterApplicationContextExecutor.close(DependencyWaiterApplicationContextExecutor.java:382)
	at org.eclipse.gemini.blueprint.context.support.AbstractDelegatedExecutionApplicationContext.doClose(AbstractDelegatedExecutionApplicationContext.java:214)
	at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:1000)
	- locked <0x00000007ee6702a8> (a java.lang.Object)
	at org.eclipse.gemini.blueprint.extender.internal.activator.LifecycleManager.closeApplicationContext(LifecycleManager.java:310)
	at org.eclipse.gemini.blueprint.extender.internal.activator.LifecycleManager.access$200(LifecycleManager.java:49)
	at org.eclipse.gemini.blueprint.extender.internal.activator.LifecycleManager$2.run(LifecycleManager.java:285)
	at org.eclipse.gemini.blueprint.extender.internal.util.concurrent.RunnableTimedExecution$MonitoredRunnable.run(RunnableTimedExecution.java:55)
	at org.springframework.scheduling.timer.DelegatingTimerTask.run(DelegatingTimerTask.java:70)
	at java.util.TimerThread.mainLoop(Timer.java:512)
	at java.util.TimerThread.run(Timer.java:462)




The original thread representing the synchronous shutdown event (this thread is holding the framework state lock)


"FelixPackageAdmin" daemon prio=6 tid=0x000000000a3f1000 nid=0x1a10 in Object.wait() [0x000000000b2af000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at org.eclipse.gemini.blueprint.extender.internal.util.concurrent.Counter.waitFor(Counter.java:160)
	- locked <0x00000007e4881eb0> (a org.eclipse.gemini.blueprint.extender.internal.util.concurrent.Counter)
	at org.eclipse.gemini.blueprint.extender.internal.util.concurrent.Counter.waitForZero(Counter.java:136)
	- locked <0x00000007e4881eb0> (a org.eclipse.gemini.blueprint.extender.internal.util.concurrent.Counter)
	at org.eclipse.gemini.blueprint.extender.internal.util.concurrent.RunnableTimedExecution.execute(RunnableTimedExecution.java:100)
	at org.eclipse.gemini.blueprint.extender.internal.activator.LifecycleManager.maybeCloseApplicationContextFor(LifecycleManager.java:280)
	at org.eclipse.gemini.blueprint.extender.internal.activator.ContextLoaderListener$ContextBundleListener.handleEvent(ContextLoaderListener.java:99)
	at org.eclipse.gemini.blueprint.extender.internal.activator.listeners.BaseListener.bundleChanged(BaseListener.java:87)
	at org.apache.felix.framework.util.EventDispatcher.invokeBundleListenerCallback(EventDispatcher.java:807)
	at org.apache.felix.framework.util.EventDispatcher.fireEventImmediately(EventDispatcher.java:729)
	at org.apache.felix.framework.util.EventDispatcher.fireBundleEvent(EventDispatcher.java:610)
	at org.apache.felix.framework.Felix.fireBundleEvent(Felix.java:3724)
	at org.apache.felix.framework.Felix.stopBundle(Felix.java:2192)
	at org.apache.felix.framework.Felix$RefreshHelper.stop(Felix.java:4453)
	at org.apache.felix.framework.Felix.refreshPackages(Felix.java:3544)
	at org.apache.felix.framework.PackageAdminImpl.run(PackageAdminImpl.java:336)
	at java.lang.Thread.run(Thread.java:662)
Comment 1 Glyn Normington CLA 2013-03-18 10:31:18 EDT
I wonder if it's correct for the FelixPackageAdmin thread to hold the framework state lock while driving bundle listeners as this is an alien call.

Copying Tom in for his observations about Equinox framework shutdown for comparison.

(Note there is no such thing as a temporary deadlock, so the title should simply describe the symptoms, e.g. "extended delay".)
Comment 2 Olaf Otto CLA 2013-03-18 10:57:12 EDT
I have taken a look at the OSGi spec:

... "For BundleEvent types STARTED and LAZY_ACTIVATION, the Framework must not hold the referenced bundle's "state change" lock when the BundleEvent is delivered to a SynchronousBundleListener. For the other BundleEvent types, the Framework must hold the referenced bundle's "state change" lock when the BundleEvent is delivered to a SynchronousBundleListener. A SynchronousBundleListener cannot directly call life cycle methods on the referenced bundle when the Framework is holding the referenced bundle's "state change" lock. " [1] ...

However, there is no notion of a "framework state lock", i.e. the spec suggests the lock should be bundle-specific. However:

"The bundle lifecycle processing will not proceed until all SynchronousBundleListeners have completed. SynchronousBundleListener objects will be called prior to BundleListener objects. " [1]

In summary, this does not really object to a framework state lock being held while dispatching the synchronous events (after all, they are synchronous...), so launching asynchronous processes altering further framework state remains dangerous IMO.

[1] http://www.osgi.org/javadoc/r4v43/core/org/osgi/framework/SynchronousBundleListener.html
Comment 3 Glyn Normington CLA 2013-03-18 11:51:59 EDT
I agree: holding a framework state lock while driving synchronous bundle listeners  is probably consistent with the OSGi spec.

However, alien calls are deadlock prone and best avoided for that reason.
Comment 4 Thomas Watson CLA 2013-03-18 11:52:20 EDT
I will not claim the current Equinox (Kepler) release has a better grasp on thread-safety or proper control over holding locks while calling out to listeners than Felix. 

But a well behaved implementation should not hold global locks while calling out to the bundle listeners.  There are certain bundle events that require bundle state locks to be held (e.g. STARTING and STOPPING).  I know in Equinox we had to go to great lengths to avoid holding global locks as much as possible, but we are still far from perfect in this regard.

I currently am in the process of rewriting much of the core of Equinox to fix just such issues.  But this does require rethinking of the core container.

But for this particular issue it seems Felix has some strange behavior that attempts to resolve dynamic packages while in the middle of ServiceReference.isAssignableTo and this causes the global lock to be needed in order to attempt resolution.  Equinox certainly does not do this.  We only use the current wiring state to figure out isAssignableTo and do not consider any future possible dynamic package resolutions.
Comment 5 Glyn Normington CLA 2013-03-18 11:58:50 EDT
Thanks Tom. So I think it is fair to say that it should be a design objective of the framework not to be prone to deadlocking, which means avoiding alien calls whenever possible.

Switching Gemini Blueprint to drive application context shutdown synchronously would work around this particular issue, but would open up others. The design philosophy of middleware like Gemini Blueprint should, in general, to behave sensibly even when application code misbehaves. Of course there are limits to this, but that philosophy tends to scale up better since the larger the number of application bundles running in the system, the greater the chance that one of them will misbehave. Ideally, we want a bundle-specific misbehaviour to have a bundle-specific, or at least a "local" rather than a global, effect.

I suggest raising a bug against Felix to remove the alien call.
Comment 6 Olaf Otto CLA 2013-03-18 12:54:59 EDT
Thanks Gly & Tom for the great input!

Indeed, Felix is rather susceptible to deadlocks - mostly due to the "global" locking strategy of the framework state.

I have taken a look at the Felix issue tracker and there are a significant number of related issues ([1]). 

Specifically, https://issues.apache.org/jira/browse/FELIX-836 shows that the problem is known but not considered a general framework issue (the Felix SCR implementation apparently experienced the same problem). Their solution is to perform any service unregistration (or component deactivation in the SCR context) asynchronously [2].

For the fun of it, I have actually given this approach a try by introducing an asynchronous unregistration method in OsgiServiceUtils (using a single-threaded executor service to keep execution order). It works, but this is way to much of a drity hack.

Since I do not expect this behavior to be changed in Felix, it would make me rather happy if we could make the asynchronous context shutdown configurable (with asynchronous being the default). Would you consider this an option? I'd be happy to provide a patch.


[1] https://issues.apache.org/jira/issues/?jql=project%20%3D%20FELIX%20AND%20text%20~%20%22fireBundleEvent%20lock%22

[2] https://issues.apache.org/jira/browse/FELIX-836?focusedCommentId=12671137&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12671137
Comment 7 Olaf Otto CLA 2013-03-18 13:54:52 EDT
Created attachment 228584 [details]
Patch for the extenderConfiguration and LifecycleManager to make asynchronous context shutdown configurable.

I have created and successfully tested the before mentioned patch against the master.
Comment 8 Glyn Normington CLA 2013-03-19 06:32:41 EDT
(In reply to comment #7)
> Created attachment 228584 [details]
> Patch for the extenderConfiguration and LifecycleManager to make
> asynchronous context shutdown configurable.

Thanks. Please could you also attach a documentation patch otherwise the new option will be an undocumented external.

> 
> I have created and successfully tested the before mentioned patch against
> the master.

Please note that before releasing, the integration tests must pass with Equinox, Felix, and Knopflerfish. I'm not sure which you tested against. If you are interested in testing with all of them, please see step 2.1 of this:

http://wiki.eclipse.org/Gemini/Blueprint/Building

However, I can't see that this change will behave any differently on different frameworks given that the default behaviour is intended to be the same as today.
Comment 9 Olaf Otto CLA 2013-03-19 11:09:42 EDT
Created attachment 228646 [details]
Patch for the extenderConfiguration and LifecycleManager to make asynchronous context shutdown configurable. Includes Documentation and LifecycleManager testcase.

I have updated the documentation regarding extenderProperties and successfully executed the integration tests for felix, knoplerfish and equinox.
Furthermore, I have added a new testcase for the LifecycleManager covering synchronous and asynchrounous execution (as far as that is possible with easymock 1.2 ...).

While it is not possible to prevent blocking in a synchronous shutdown, I have now added a try/catch block around the synchronous execution to prevent exceptions during shutdown to disrupt the lifecycle, so at least that bit of gracefulness is preserved.

As always, I hereby confirm that I have written 100% of the new code myself and do have the rights to contribute it to this project.
Comment 10 Glyn Normington CLA 2013-03-19 11:52:47 EDT
Thanks for the contribution Olaf! Committed to master as e552de0a5cf17cabdc2db9c8f9470d27dafa96c4. Note that I corrected the prologue of the new file: you are the copyright owner and initial contributor. All the tests pass for me too on all three frameworks.
Comment 11 Olaf Otto CLA 2013-10-18 05:18:31 EDT
There is one more situation where the above configuration must be considered:

LifecycleManager#destroy also shuts down each context asynchronously (while in a serial fashion). This exhibits the same locking issues as for the individual context shutdowns. I will thus apply the same behavior there:

In case of extenderConfiguration.shouldShutdownAsynchronously() == false, the context shutdowns are down synchronously.
Comment 12 Olaf Otto CLA 2013-10-18 09:46:31 EDT
Fixed in b1bf99eb05e2924639415ca2a07b1a0ec47acf0d
Comment 13 Florian Waibel CLA 2015-12-21 09:48:21 EST
Milestone 2.0.0.M02 has been delivered.