Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 482169

Summary: Orion server blocks requests during startup
Product: [ECD] Orion Reporter: John Arthorne <john.arthorne>
Component: ServerAssignee: John Arthorne <john.arthorne>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3    
Version: unspecified   
Target Milestone: 11.0   
Hardware: PC   
OS: Windows 7   
Whiteboard:

Description John Arthorne CLA 2015-11-13 16:13:38 EST
In a very large Orion deployment, the Java web server can be blocked from taking requests while the server is "warming up". The sequence is as follows:

1) All threads taking user requests get blocked on FilterWrapper#initializeDelegate:

Blocked on: org/eclipse/equinox/http/registry/internal/FilterManager$FilterWrapper@0x0000000600F86F98 Owned by: "Default Executor-thread-10" (J9VMThread:0x00000000031D1E00, java/lang/Thread:0x0000000600FA9990)
Heap bytes allocated since last GC cycle=0 (0x0)
Java callstack:
at org/eclipse/equinox/http/registry/internal/FilterManager$FilterWrapper.doFilter(FilterManager.java:172)
at org/eclipse/equinox/http/servlet/internal/FilterRegistration.doFilter(FilterRegistration.java:81)
at org/eclipse/equinox/http/servlet/internal/FilterChainImpl.doFilter(FilterChainImpl.java:35)
at org/eclipse/equinox/http/servlet/internal/ProxyServlet.processAlias(ProxyServlet.java:132)
at org/eclipse/equinox/http/servlet/internal/ProxyServlet.service(ProxyServlet.java:68)

2) One thread owns the FilterWrapper lock, and is blocked on class loading:

Parked on: java/util/concurrent/locks/ReentrantLock$NonfairSync@0x0000000600FA7450 Owned by: "Component Resolve Thread" (J9VMThread:0x0000000002FBAE00, java/lang/Thread:0x0000000600FA7C48)
Heap bytes allocated since last GC cycle=65536 (0x10000)
Java callstack:
at sun/misc/Unsafe.park(Native Method)
at java/util/concurrent/locks/LockSupport.parkNanos(LockSupport.java:238(Compiled Code))
at java/util/concurrent/locks/AbstractQueuedSynchronizer.doAcquireNanos(AbstractQueuedSynchronizer.java:941)
at java/util/concurrent/locks/AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1257)
at java/util/concurrent/locks/ReentrantLock.tryLock(ReentrantLock.java:457)
at org/eclipse/osgi/container/Module.lockStateChange(Module.java:299)
at org/eclipse/osgi/container/Module.start(Module.java:389)
at org/eclipse/osgi/framework/util/SecureAction.start(SecureAction.java:454)
at org/eclipse/osgi/internal/hooks/EclipseLazyStarter.postFindLocalClass(EclipseLazyStarter.java:107)
at org/eclipse/osgi/internal/loader/classpath/ClasspathManager.findLocalClass(ClasspathManager.java:531)
at org/eclipse/osgi/internal/loader/ModuleClassLoader.findLocalClass(ModuleClassLoader.java:324)
at org/eclipse/osgi/internal/loader/BundleLoader.findLocalClass(BundleLoader.java:311)
at org/eclipse/osgi/internal/loader/sources/SingleSourcePackage.loadClass(SingleSourcePackage.java:36)
at org/eclipse/osgi/internal/loader/BundleLoader.findClassInternal(BundleLoader.java:382)
at org/eclipse/osgi/internal/loader/BundleLoader.findClass(BundleLoader.java:336(Compiled Code))
at org/eclipse/osgi/internal/loader/BundleLoader.findClass(BundleLoader.java:328(Compiled Code))
at org/eclipse/osgi/internal/loader/ModuleClassLoader.loadClass(ModuleClassLoader.java:160(Compiled Code))
at java/lang/ClassLoader.loadClass(ClassLoader.java:741(Compiled Code))
at org/eclipse/osgi/internal/framework/EquinoxBundle.loadClass(EquinoxBundle.java:568)
at org/eclipse/core/internal/registry/osgi/RegistryStrategyOSGI.createExecutableExtension(RegistryStrategyOSGI.java:174)
at org/eclipse/core/internal/registry/ExtensionRegistry.createExecutableExtension(ExtensionRegistry.java:905)
at org/eclipse/core/internal/registry/ConfigurationElement.createExecutableExtension(ConfigurationElement.java:243)
at org/eclipse/core/internal/registry/ConfigurationElementHandle.createExecutableExtension(ConfigurationElementHandle.java:55)
at org/eclipse/equinox/http/registry/internal/FilterManager$FilterWrapper.initializeDelegate(FilterManager.java:183)

3) The class loader is blocked by the Orion server core activator, which is reading all user metadata during start. This is an expensive operation that can take minutes to run for a large Orion install with 100K users:

at org/eclipse/orion/internal/server/core/metastore/SimpleMetaStore.readAllUsers(SimpleMetaStore.java:588)
at org/eclipse/orion/internal/server/core/Activator.initializeMetaStore(Activator.java:153)
at org/eclipse/orion/internal/server/core/Activator.start(Activator.java:168)
Comment 1 John Arthorne CLA 2015-11-13 16:17:25 EST
One solution would be to defer initializing the user property cache until needed, or do it asynchronously outside of the bundle startup. However this cache is needed to lookup a user by id or email, which is needed by authentication, etc. So the server is not entirely useful until this initialization completes. 

Another solution is to expose an endpoint which triggers any kind of server "warmup" such as populating the user cache. This allows a load balancer or deployment automation system to avoid the server taking traffic before it is fully prepared to take requests. Monitoring of Orion server liveness is often done by doing a "wget" on the server about page, so that endpoint could be used to trigger warmup operations such as the user cache.

Yet another solution is to store the user cache in some persistent store, to avoid the cost of computing it on startup. This adds complexity to the server metadata and makes user management more complex (today you can remove a user simply by removing or unlinking their workspace directory).
Comment 2 John Arthorne CLA 2015-11-13 16:32:52 EST
It's not the most elegant solution, but I have released this simple fix that allows us to validate the problem in a large scale Orion install:

http://git.eclipse.org/c/orion/org.eclipse.orion.server.git/commit/?id=fe1ab55ba63976fc0245780caa4c334c4eef67bf
Comment 3 John Arthorne CLA 2015-11-14 09:39:12 EST
I have also added logging of the two most expensive init steps so we can confirm how long they are taking:

http://git.eclipse.org/c/orion/org.eclipse.orion.server.git/commit/?id=13c308d2b5286cf4952a2715948a55c30629e5c3
Comment 4 John Arthorne CLA 2015-11-17 21:20:20 EST
Confirmed that this fix works as expected for a large Orion install.