Community
Participate
Working Groups
Build Identifier: org.eclipse.equinox.servletbridge_1.1.0.v20090520-1800 We recently encountered a deadlock in BridgeServlet, where it appears that a background framework restart was in progress (which has a lock on FrameworkLauncher) and BridgeServlet.service attempts to get the framework classloader (which requires a lock on FrameworkLauncher). The restart thread is waiting in unregisterServletDelegate for all of the accessors of BridgeServlet to finish, and the web container thread ends up waiting forever for the lock on FrameworkServlet. Java dump (attached) included these two relevant snippets: Relevant snippets: 3XMTHREADINFO "Framework Event Dispatcher" TID:0x19865800, j9thread_t:0x1D165E04, state:CW, prio=5 3XMTHREADINFO1 (native thread ID:0x1E1C, native priority:0x5, native policy:UNKNOWN) 4XESTACKTRACE at java/lang/Object.wait(Native Method) 4XESTACKTRACE at java/lang/Object.wait(Object.java:167) 4XESTACKTRACE at org/eclipse/equinox/servletbridge/BridgeServlet.unregisterServletDelegate(BridgeServlet.java:257) 4XESTACKTRACE at org/eclipse/equinox/http/servletbridge/internal/Activator.stop(Activator.java:29) 4XESTACKTRACE at org/eclipse/osgi/framework/internal/core/BundleContextImpl$2.run(BundleContextImpl.java:843) 4XESTACKTRACE at java/security/AccessController.doPrivileged(AccessController.java:251(Compiled Code)) 4XESTACKTRACE at org/eclipse/osgi/framework/internal/core/BundleContextImpl.stop(BundleContextImpl.java:836) 4XESTACKTRACE at org/eclipse/osgi/framework/internal/core/BundleHost.stopWorker(BundleHost.java:474) 4XESTACKTRACE at org/eclipse/osgi/framework/internal/core/AbstractBundle.suspend(AbstractBundle.java:546) 4XESTACKTRACE at org/eclipse/osgi/framework/internal/core/Framework.suspendBundle(Framework.java:1098) 4XESTACKTRACE at org/eclipse/osgi/framework/internal/core/StartLevelManager.decFWSL(StartLevelManager.java:593) 4XESTACKTRACE at org/eclipse/osgi/framework/internal/core/StartLevelManager.doSetStartLevel(StartLevelManager.java:261) 4XESTACKTRACE at org/eclipse/osgi/framework/internal/core/StartLevelManager.shutdown(StartLevelManager.java:216) 4XESTACKTRACE at org/eclipse/osgi/framework/internal/core/InternalSystemBundle.suspend(InternalSystemBundle.java:266) 4XESTACKTRACE at org/eclipse/osgi/framework/internal/core/Framework.shutdown(Framework.java:685) 4XESTACKTRACE at org/eclipse/osgi/framework/internal/core/Framework.close(Framework.java:583) 4XESTACKTRACE at org/eclipse/core/runtime/adaptor/EclipseStarter.shutdown(EclipseStarter.java:409) 4XESTACKTRACE at sun/reflect/NativeMethodAccessorImpl.invoke0(Native Method) 4XESTACKTRACE at sun/reflect/NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45(Compiled Code)) 4XESTACKTRACE at sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37(Compiled Code)) 4XESTACKTRACE at java/lang/reflect/Method.invoke(Method.java:599(Compiled Code)) 4XESTACKTRACE at org/eclipse/equinox/servletbridge/FrameworkLauncher.stop(FrameworkLauncher.java:522) 4XESTACKTRACE at org/eclipse/equinox/servletbridge/FrameworkLauncher$2.run(FrameworkLauncher.java:362) 4XESTACKTRACE at org/eclipse/core/runtime/adaptor/EclipseStarter$1.bundleChanged(EclipseStarter.java:1504) 4XESTACKTRACE at org/eclipse/osgi/framework/internal/core/BundleContextImpl.dispatchEvent(BundleContextImpl.java:919) 4XESTACKTRACE at org/eclipse/osgi/framework/eventmgr/EventManager.dispatchEvent(EventManager.java:227) 4XESTACKTRACE at org/eclipse/osgi/framework/eventmgr/EventManager$EventThread.run(EventManager.java:337) 3XMTHREADINFO "WebContainer : 0" TID:0x1ACCCE00, j9thread_t:0x1D6CA658, state:B, prio=5 3XMTHREADINFO1 (native thread ID:0x1A80, native priority:0x5, native policy:UNKNOWN) 4XESTACKTRACE at org/eclipse/equinox/servletbridge/BridgeServlet.service(BridgeServlet.java:119) 4XESTACKTRACE at javax/servlet/http/HttpServlet.service(HttpServlet.java:831) 4XESTACKTRACE at com/ibm/ws/webcontainer/servlet/ServletWrapper.service(ServletWrapper.java:1657) The apparent flow is: webcontainer:0 enters BridgeServlet.service and calls acquireDelegateReference(), which increments the reference count of servlets accessing the framework. webcontainer:0 loses context FrameworkEventDispatcher is invoked on detection of a change to a bundle (not sure why, need to dig more) FrameworkEventDispatcher calls FrameworkLauncher.stop, which is synchronized (takes lock on FrameworkLauncher) FrameworkEventDispatcher then drills down deep into the OSGi shutdown sequence - near the end it goes to unregister the servlet delegate on the BridgeServlet class. It enters unregisterServletDelegate (which takes a lock on BridgeServlet) and then takes a lock on instance, and then because webcontainer:0 is still accessing BridgeServlet.service (hasn't called releaseDelegateReference), it waits for a notification on the instance lock. webcontainer:0 now wakes up and calls FrameworkLauncher.getFrameworkClassLoader() which is synchronized, so it needs a lock on FrameworkLauncher. FrameworkEventDispatcher still owns this lock, so it waits. Deadlock. Proposed solution is to attempt to get FrameworkLauncher.getFrameworkClassLoader() first, then to acquireDelegateReference(). If that method returns null, it means the framework is shut down and so we return immediately. If the method returns a value, it means the framework is still running AND no one else can terminate until we complete because we incremented the reference count (the root of the problem). Appears to fix the problem at hand, however I don't know if there are other issues that could occur due to contention or memory boundaries. If this fix is insufficient, it would help to know where we fail so we could patch it locally. Current BridgeServlet: 112 ClassLoader original = Thread.currentThread().getContextClassLoader(); 113 HttpServlet servletReference = acquireDelegateReference(); 114 if (servletReference == null) { 115 resp.sendError(HttpServletResponse.SC_NOT_FOUND, "BridgeServlet: " + req.getRequestURI()); //$NON-NLS-1$ 116 return; 117 } 118 try { 119 Thread.currentThread().setContextClassLoader(framework.getFrameworkContextClassLoader()); 120 servletReference.service(req, resp); 121 } finally { 122 releaseDelegateReference(); 123 Thread.currentThread().setContextClassLoader(original); 124 } Proposed BridgeServlet: 112 ClassLoader original = Thread.currentThread().getContextClassLoader(); 113 ClassLoader frameworkClassLoader = framework.getFrameworkContextClassLoader(); 114 HttpServlet servletReference = acquireDelegateReference(); 115 if (servletReference == null) { 116 resp.sendError(HttpServletResponse.SC_NOT_FOUND, "BridgeServlet: " + req.getRequestURI()); //$NON-NLS-1$ 117 return; 118 } 119 try { 120 Thread.currentThread().setContextClassLoader(frameworkClassLoader); 121 servletReference.service(req, resp); 122 } finally { 123 releaseDelegateReference(); 124 Thread.currentThread().setContextClassLoader(original); 125 } Reproducible: Sometimes Steps to Reproduce: In an OSGi servletbridged web app, invoke FrameworkLauncher stop() in a separate thread after a webcontainer thread has acquired the delegate reference but before the webcontainer thread accesses the framework class loader. Actually triggering the framework stop or restart is more complex - in the stack trace you can see the bundleChanged event is being fired by the framework, but I'm not sure what leads to that. That triggers the forced restart of the framework from FrameworkLauncher, which calls stop() and thus sets this chain in motion. It should be sufficient to recreate to insert a fairly long sleep between acquireDelegateReference() and the getFrameworkClassLoader() call, and invoke stop() via the bridge servlet's mechanism. It's a very small window, but apparently we can reproduce this fairly often in our environment (automated deployment, starts web application, immediately begins running tests and changes filesystem contents).
Created attachment 192771 [details] Java core for the hung web container thread, generated by WebSphere
Thanks for the analysis, excellent stack trace and context. We've seen this problem reported once before about a year ago but were unable to do very much with it because of the difficulty in reproducing the problem. This is nasty stuff and your description really helps -- you are a star.
Closing due to inactivity.