| Summary: | Asynchronous context shutdown causes timeouts during LifecycleManager#maybeCloseApplicationContextFor(Bundle) in Apache Felix | ||
|---|---|---|---|
| Product: | [RT] Gemini.Blueprint | Reporter: | Olaf Otto <olaf> |
| Component: | Core | Assignee: | Olaf Otto <olaf> |
| Status: | CLOSED FIXED | QA Contact: | |
| Severity: | minor | ||
| Priority: | P3 | CC: | fwaibel, glyn.normington, tjwatson |
| Version: | 1.0.2.RELEASE | ||
| Target Milestone: | 2.0.0.M02 | ||
| Hardware: | PC | ||
| OS: | Windows 7 | ||
| Whiteboard: | |||
| Bug Depends on: | |||
| Bug Blocks: | 393961 | ||
| Attachments: | |||
|
Description
Olaf Otto
I wonder if it's correct for the FelixPackageAdmin thread to hold the framework state lock while driving bundle listeners as this is an alien call. Copying Tom in for his observations about Equinox framework shutdown for comparison. (Note there is no such thing as a temporary deadlock, so the title should simply describe the symptoms, e.g. "extended delay".) I have taken a look at the OSGi spec: ... "For BundleEvent types STARTED and LAZY_ACTIVATION, the Framework must not hold the referenced bundle's "state change" lock when the BundleEvent is delivered to a SynchronousBundleListener. For the other BundleEvent types, the Framework must hold the referenced bundle's "state change" lock when the BundleEvent is delivered to a SynchronousBundleListener. A SynchronousBundleListener cannot directly call life cycle methods on the referenced bundle when the Framework is holding the referenced bundle's "state change" lock. " [1] ... However, there is no notion of a "framework state lock", i.e. the spec suggests the lock should be bundle-specific. However: "The bundle lifecycle processing will not proceed until all SynchronousBundleListeners have completed. SynchronousBundleListener objects will be called prior to BundleListener objects. " [1] In summary, this does not really object to a framework state lock being held while dispatching the synchronous events (after all, they are synchronous...), so launching asynchronous processes altering further framework state remains dangerous IMO. [1] http://www.osgi.org/javadoc/r4v43/core/org/osgi/framework/SynchronousBundleListener.html I agree: holding a framework state lock while driving synchronous bundle listeners is probably consistent with the OSGi spec. However, alien calls are deadlock prone and best avoided for that reason. I will not claim the current Equinox (Kepler) release has a better grasp on thread-safety or proper control over holding locks while calling out to listeners than Felix. But a well behaved implementation should not hold global locks while calling out to the bundle listeners. There are certain bundle events that require bundle state locks to be held (e.g. STARTING and STOPPING). I know in Equinox we had to go to great lengths to avoid holding global locks as much as possible, but we are still far from perfect in this regard. I currently am in the process of rewriting much of the core of Equinox to fix just such issues. But this does require rethinking of the core container. But for this particular issue it seems Felix has some strange behavior that attempts to resolve dynamic packages while in the middle of ServiceReference.isAssignableTo and this causes the global lock to be needed in order to attempt resolution. Equinox certainly does not do this. We only use the current wiring state to figure out isAssignableTo and do not consider any future possible dynamic package resolutions. Thanks Tom. So I think it is fair to say that it should be a design objective of the framework not to be prone to deadlocking, which means avoiding alien calls whenever possible. Switching Gemini Blueprint to drive application context shutdown synchronously would work around this particular issue, but would open up others. The design philosophy of middleware like Gemini Blueprint should, in general, to behave sensibly even when application code misbehaves. Of course there are limits to this, but that philosophy tends to scale up better since the larger the number of application bundles running in the system, the greater the chance that one of them will misbehave. Ideally, we want a bundle-specific misbehaviour to have a bundle-specific, or at least a "local" rather than a global, effect. I suggest raising a bug against Felix to remove the alien call. Thanks Gly & Tom for the great input! Indeed, Felix is rather susceptible to deadlocks - mostly due to the "global" locking strategy of the framework state. I have taken a look at the Felix issue tracker and there are a significant number of related issues ([1]). Specifically, https://issues.apache.org/jira/browse/FELIX-836 shows that the problem is known but not considered a general framework issue (the Felix SCR implementation apparently experienced the same problem). Their solution is to perform any service unregistration (or component deactivation in the SCR context) asynchronously [2]. For the fun of it, I have actually given this approach a try by introducing an asynchronous unregistration method in OsgiServiceUtils (using a single-threaded executor service to keep execution order). It works, but this is way to much of a drity hack. Since I do not expect this behavior to be changed in Felix, it would make me rather happy if we could make the asynchronous context shutdown configurable (with asynchronous being the default). Would you consider this an option? I'd be happy to provide a patch. [1] https://issues.apache.org/jira/issues/?jql=project%20%3D%20FELIX%20AND%20text%20~%20%22fireBundleEvent%20lock%22 [2] https://issues.apache.org/jira/browse/FELIX-836?focusedCommentId=12671137&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12671137 Created attachment 228584 [details]
Patch for the extenderConfiguration and LifecycleManager to make asynchronous context shutdown configurable.
I have created and successfully tested the before mentioned patch against the master.
(In reply to comment #7) > Created attachment 228584 [details] > Patch for the extenderConfiguration and LifecycleManager to make > asynchronous context shutdown configurable. Thanks. Please could you also attach a documentation patch otherwise the new option will be an undocumented external. > > I have created and successfully tested the before mentioned patch against > the master. Please note that before releasing, the integration tests must pass with Equinox, Felix, and Knopflerfish. I'm not sure which you tested against. If you are interested in testing with all of them, please see step 2.1 of this: http://wiki.eclipse.org/Gemini/Blueprint/Building However, I can't see that this change will behave any differently on different frameworks given that the default behaviour is intended to be the same as today. Created attachment 228646 [details]
Patch for the extenderConfiguration and LifecycleManager to make asynchronous context shutdown configurable. Includes Documentation and LifecycleManager testcase.
I have updated the documentation regarding extenderProperties and successfully executed the integration tests for felix, knoplerfish and equinox.
Furthermore, I have added a new testcase for the LifecycleManager covering synchronous and asynchrounous execution (as far as that is possible with easymock 1.2 ...).
While it is not possible to prevent blocking in a synchronous shutdown, I have now added a try/catch block around the synchronous execution to prevent exceptions during shutdown to disrupt the lifecycle, so at least that bit of gracefulness is preserved.
As always, I hereby confirm that I have written 100% of the new code myself and do have the rights to contribute it to this project.
Thanks for the contribution Olaf! Committed to master as e552de0a5cf17cabdc2db9c8f9470d27dafa96c4. Note that I corrected the prologue of the new file: you are the copyright owner and initial contributor. All the tests pass for me too on all three frameworks. There is one more situation where the above configuration must be considered: LifecycleManager#destroy also shuts down each context asynchronously (while in a serial fashion). This exhibits the same locking issues as for the individual context shutdowns. I will thus apply the same behavior there: In case of extenderConfiguration.shouldShutdownAsynchronously() == false, the context shutdowns are down synchronously. Fixed in b1bf99eb05e2924639415ca2a07b1a0ec47acf0d Milestone 2.0.0.M02 has been delivered. |