Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 470394

Summary: [server] Fatal error starting authentication service. Server metastore is unavailable!
Product: [ECD] Orion Reporter: John Arthorne <john.arthorne>
Component: ServerAssignee: Anthony Hunter <ahunter.eclipse>
Status: RESOLVED FIXED QA Contact:
Severity: major    
Priority: P3 CC: ahunter.eclipse, Carolyn_MacLeod
Version: 9.0   
Target Milestone: 10.0   
Hardware: PC   
OS: Windows 7   
Whiteboard:

Description John Arthorne CLA 2015-06-17 11:12:02 EDT
We sometimes see an error occur on startup, which only seems to occur when server is under a heavy request load as it is starting up. In bug 466323 I added additional logging to narrow down the failure further, and we are seeing only this in the log when it occurs:

2015-06-16 17:43:12.292 [Component Resolve Thread (Bundle 118)] INFO  org.eclipse.orion.server.config - Server configuration file loaded from: /opt/IBM/orion/server/conf/orion.conf
2015-06-16 17:43:12.311 [Component Resolve Thread (Bundle 118)] INFO  org.eclipse.orion.server.config - Using MQTT message broker at ssl://xxxxx:yyyy
2015-06-16 17:43:12.412 [Component Resolve Thread (Bundle 118)] INFO  org.eclipse.orion.server.config - MQTT client connected.
2015-06-16 17:43:52.515 [Start Level: Equinox Container: 20e88aee-7814-0015-1806-f14ba63b564b] ERROR org.eclipse.orion.server.core - Fatal error starting authentication service. Server metastore is unavailable!
2015-06-16 17:46:26.850 [Component Resolve Thread] INFO  org.eclipse.orion.server.config - Loaded simple metadata store (version 8).
Comment 1 John Arthorne CLA 2015-06-17 11:13:49 EDT
When this occurs, at least one bundle (server auth) fails to start because the metastore location is null. This causes the server to be in a catastrophically broken state, and a server restart is required to fix it. Here is a stack of the failure:

Caused by: org.osgi.framework.BundleException: Exception in org.eclipse.orion.server.authentication.Activator.start() of bundle org.eclipse.orion.server.authentication.
	at org.eclipse.osgi.internal.framework.BundleContextImpl.startActivator(BundleContextImpl.java:792)
	at org.eclipse.osgi.internal.framework.BundleContextImpl.start(BundleContextImpl.java:721)
	at org.eclipse.osgi.internal.framework.EquinoxBundle.startWorker0(EquinoxBundle.java:936)
	at org.eclipse.osgi.internal.framework.EquinoxBundle$EquinoxModule.startWorker(EquinoxBundle.java:319)
	at org.eclipse.osgi.container.Module.doStart(Module.java:571)
	at org.eclipse.osgi.container.Module.start(Module.java:439)
	at org.eclipse.osgi.framework.util.SecureAction.start(SecureAction.java:454)
	at org.eclipse.osgi.internal.hooks.EclipseLazyStarter.postFindLocalClass(EclipseLazyStarter.java:107)
	... 48 more
Caused by: java.lang.RuntimeException: Fatal error starting authentication service. Server metastore is unavailable!
	at org.eclipse.orion.server.authentication.Activator.start(Activator.java:51)
	at org.eclipse.osgi.internal.framework.BundleContextImpl$3.run(BundleContextImpl.java:771)
	at org.eclipse.osgi.internal.framework.BundleContextImpl$3.run(BundleContextImpl.java:1)
	at java.security.AccessController.doPrivileged(AccessController.java:330)
	at org.eclipse.osgi.internal.framework.BundleContextImpl.startActivator(BundleContextImpl.java:764)
	... 55 more
Comment 2 John Arthorne CLA 2015-06-17 11:24:16 EDT
At first I thought this must be due to a failure accessing the backing storage device, resulting in a failure to initialize the metastore location. However there is now logging for those failures and I am seeing no such event logged when the failure occurs. There are two remaining possibilities that I can see:

1) server.core bundle has not yet started
2) server.core bundle has been stopped

From later logged events I can see server.core is active so I think 1) is the most likely situation. The bundle is lazy started so in theory it should not be possible to access Activator.getMetaStore() when the bundle has not been started. However the call chain is within another bundle activator so maybe we are seeing a timing window where server.activator is activated prior to server.core. 

We had a similar scenario in bug 407869 with a race condition between search and metastore initialization.

My theory is that we rarely see this because the race condition only occurs if an inbound client request causes servlets to be started before server.core initialization is complete. Both times we have seen this, the server was under heavy request load during startup (restarting a production server while it is taking user traffic).
Comment 3 Anthony Hunter CLA 2015-06-17 12:25:17 EDT
Funny (not) that this issue came up. When the orionhub.org VM server crashed last week, when I restarted orionhub.org it blew up as well. I had to restart a second time.

The log is at /home/admin/orionhub.org/current/eclipse-20150612-1125/log.txt

There errors start with:
!SESSION 2015-06-12 11:20:14.117 -----------------------------------------------
eclipse.buildId=9.0.0-v20150512-1027
java.version=1.7.0_71
java.vendor=Oracle Corporation
BootLoader constants: OS=linux, ARCH=x86_64, WS=gtk, NL=en_CA
Framework arguments:  -vm /home/admin/orionhub.org/current/current/jdk1.7.0_25/jre/bin/java
Command-line arguments:  -os linux -ws gtk -arch x86_64 -consoleLog -data /localdata/orionhub.org/serverworkspace -vm /home/admin/orionhub.org/current/current/jdk1.7.0_25/jre/bin/java

!ENTRY org.eclipse.equinox.ds 2 0 2015-06-12 11:20:45.503
!MESSAGE [SCR] Enabling components of bundle org.eclipse.orion.server.events did not complete in 30000 ms 

!ENTRY org.eclipse.equinox.ds 2 0 2015-06-12 11:20:45.529
!MESSAGE [SCR - WorkThread] Timeout occurred! Thread was blocked on processing [QueuedJob] WorkPerformer: org.eclipse.equinox.internal.ds.SCRManager@5218f9d7; actionType 1

!ENTRY org.eclipse.osgi 2 0 2015-06-12 11:20:50.580
!MESSAGE While loading class "org.eclipse.orion.server.core.OrionConfiguration", thread "Thread[Start Level: Equinox Container: 30d68786-1611-0015-1662-c962072a9f80,5,main]" timed out waiting (5001ms) for thread "Thread[Component Resolve Thread (Bundle 36),10,main]" to finish starting bundle "org.eclipse.orion.server.core_1.0.0.v20150504-1950 [69]". To avoid deadlock, thread "Thread[Start Level: Equinox Container: 30d68786-1611-0015-1662-c962072a9f80,5,main]" is proceeding but "org.eclipse.orion.server.core.OrionConfiguration" may not be fully initialized.
!STACK 0
org.osgi.framework.BundleException: Unable to acquire the state change lock for the module: osgi.identity; osgi.identity="org.eclipse.orion.server.core"; type="osgi.bundle"; version:Version="1.0.0.v20150504-1950"; singleton:="true" [id=69] STARTED [STARTED]
        at org.eclipse.osgi.container.Module.lockStateChange(Module.java:329)
        at org.eclipse.osgi.container.Module.start(Module.java:389)
Comment 4 Anthony Hunter CLA 2015-07-23 16:58:13 EDT
The error is caused by OrionConfiguration.getMetaStore() somehow returning null during startup. We are unable to reproduce this issue in development.

We do not want to cause Activator.start() to fail when this issue occurs.

The solution is to move the work performed in the start() method into a Job. If the null condition occurs, we can wait five seconds and try again.

The code change is in commit:
http://git.eclipse.org/c/orion/org.eclipse.orion.server.git/commit/?id=d45ed98af302c5fdbeda482679918dc4bfff0fe9
Comment 5 Anthony Hunter CLA 2015-07-24 12:42:39 EDT
Can now see the log message we want:

[Worker-0] INFO  org.eclipse.orion.server.config - Started Orion server authentication successfully.