| Summary: | [server] Fatal error starting authentication service. Server metastore is unavailable! | ||
|---|---|---|---|
| Product: | [ECD] Orion | Reporter: | John Arthorne <john.arthorne> |
| Component: | Server | Assignee: | Anthony Hunter <ahunter.eclipse> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | major | ||
| Priority: | P3 | CC: | ahunter.eclipse, Carolyn_MacLeod |
| Version: | 9.0 | ||
| Target Milestone: | 10.0 | ||
| Hardware: | PC | ||
| OS: | Windows 7 | ||
| Whiteboard: | |||
|
Description
John Arthorne
When this occurs, at least one bundle (server auth) fails to start because the metastore location is null. This causes the server to be in a catastrophically broken state, and a server restart is required to fix it. Here is a stack of the failure: Caused by: org.osgi.framework.BundleException: Exception in org.eclipse.orion.server.authentication.Activator.start() of bundle org.eclipse.orion.server.authentication. at org.eclipse.osgi.internal.framework.BundleContextImpl.startActivator(BundleContextImpl.java:792) at org.eclipse.osgi.internal.framework.BundleContextImpl.start(BundleContextImpl.java:721) at org.eclipse.osgi.internal.framework.EquinoxBundle.startWorker0(EquinoxBundle.java:936) at org.eclipse.osgi.internal.framework.EquinoxBundle$EquinoxModule.startWorker(EquinoxBundle.java:319) at org.eclipse.osgi.container.Module.doStart(Module.java:571) at org.eclipse.osgi.container.Module.start(Module.java:439) at org.eclipse.osgi.framework.util.SecureAction.start(SecureAction.java:454) at org.eclipse.osgi.internal.hooks.EclipseLazyStarter.postFindLocalClass(EclipseLazyStarter.java:107) ... 48 more Caused by: java.lang.RuntimeException: Fatal error starting authentication service. Server metastore is unavailable! at org.eclipse.orion.server.authentication.Activator.start(Activator.java:51) at org.eclipse.osgi.internal.framework.BundleContextImpl$3.run(BundleContextImpl.java:771) at org.eclipse.osgi.internal.framework.BundleContextImpl$3.run(BundleContextImpl.java:1) at java.security.AccessController.doPrivileged(AccessController.java:330) at org.eclipse.osgi.internal.framework.BundleContextImpl.startActivator(BundleContextImpl.java:764) ... 55 more At first I thought this must be due to a failure accessing the backing storage device, resulting in a failure to initialize the metastore location. However there is now logging for those failures and I am seeing no such event logged when the failure occurs. There are two remaining possibilities that I can see: 1) server.core bundle has not yet started 2) server.core bundle has been stopped From later logged events I can see server.core is active so I think 1) is the most likely situation. The bundle is lazy started so in theory it should not be possible to access Activator.getMetaStore() when the bundle has not been started. However the call chain is within another bundle activator so maybe we are seeing a timing window where server.activator is activated prior to server.core. We had a similar scenario in bug 407869 with a race condition between search and metastore initialization. My theory is that we rarely see this because the race condition only occurs if an inbound client request causes servlets to be started before server.core initialization is complete. Both times we have seen this, the server was under heavy request load during startup (restarting a production server while it is taking user traffic). Funny (not) that this issue came up. When the orionhub.org VM server crashed last week, when I restarted orionhub.org it blew up as well. I had to restart a second time.
The log is at /home/admin/orionhub.org/current/eclipse-20150612-1125/log.txt
There errors start with:
!SESSION 2015-06-12 11:20:14.117 -----------------------------------------------
eclipse.buildId=9.0.0-v20150512-1027
java.version=1.7.0_71
java.vendor=Oracle Corporation
BootLoader constants: OS=linux, ARCH=x86_64, WS=gtk, NL=en_CA
Framework arguments: -vm /home/admin/orionhub.org/current/current/jdk1.7.0_25/jre/bin/java
Command-line arguments: -os linux -ws gtk -arch x86_64 -consoleLog -data /localdata/orionhub.org/serverworkspace -vm /home/admin/orionhub.org/current/current/jdk1.7.0_25/jre/bin/java
!ENTRY org.eclipse.equinox.ds 2 0 2015-06-12 11:20:45.503
!MESSAGE [SCR] Enabling components of bundle org.eclipse.orion.server.events did not complete in 30000 ms
!ENTRY org.eclipse.equinox.ds 2 0 2015-06-12 11:20:45.529
!MESSAGE [SCR - WorkThread] Timeout occurred! Thread was blocked on processing [QueuedJob] WorkPerformer: org.eclipse.equinox.internal.ds.SCRManager@5218f9d7; actionType 1
!ENTRY org.eclipse.osgi 2 0 2015-06-12 11:20:50.580
!MESSAGE While loading class "org.eclipse.orion.server.core.OrionConfiguration", thread "Thread[Start Level: Equinox Container: 30d68786-1611-0015-1662-c962072a9f80,5,main]" timed out waiting (5001ms) for thread "Thread[Component Resolve Thread (Bundle 36),10,main]" to finish starting bundle "org.eclipse.orion.server.core_1.0.0.v20150504-1950 [69]". To avoid deadlock, thread "Thread[Start Level: Equinox Container: 30d68786-1611-0015-1662-c962072a9f80,5,main]" is proceeding but "org.eclipse.orion.server.core.OrionConfiguration" may not be fully initialized.
!STACK 0
org.osgi.framework.BundleException: Unable to acquire the state change lock for the module: osgi.identity; osgi.identity="org.eclipse.orion.server.core"; type="osgi.bundle"; version:Version="1.0.0.v20150504-1950"; singleton:="true" [id=69] STARTED [STARTED]
at org.eclipse.osgi.container.Module.lockStateChange(Module.java:329)
at org.eclipse.osgi.container.Module.start(Module.java:389)
The error is caused by OrionConfiguration.getMetaStore() somehow returning null during startup. We are unable to reproduce this issue in development. We do not want to cause Activator.start() to fail when this issue occurs. The solution is to move the work performed in the start() method into a Job. If the null condition occurs, we can wait five seconds and try again. The code change is in commit: http://git.eclipse.org/c/orion/org.eclipse.orion.server.git/commit/?id=d45ed98af302c5fdbeda482679918dc4bfff0fe9 Can now see the log message we want: [Worker-0] INFO org.eclipse.orion.server.config - Started Orion server authentication successfully. |