Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 318277

Summary: Eclipse Help System WAR has a deadlock at startup
Product: [Eclipse Project] Equinox Reporter: Leo <xiexing>
Component: FrameworkAssignee: Thomas Watson <tjwatson>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: cgold, simon_kaegi, tjwatson
Version: 3.4.2Flags: simon_kaegi: review+
Target Milestone: 3.7 M1   
Hardware: Other   
OS: other   
Whiteboard:
Bug Depends on:    
Bug Blocks: 320760    
Attachments:
Description Flags
logs
none
Patch to always allow application/xhtml+xml
none
work in progress
none
work in progress 2
none
patch
none
patch
none
cleaned up patch
none
3.5 or earilier patch none

Description Leo CLA 2010-06-29 04:30:45 EDT
Build Identifier: 3.4

Our customers encountered a deadlock issue with our IEHS 3.4.2 WAR build, which is based on Eclipse Help System 3.4.2. And then, our development team extracted an EHS WAR package based on Eclipse 3.4.2. However, customers still met the deadlock error with the EHS WAR. Following is some info from customers:
======================================
The problem can be recreated by simply starting and stopping the server. I do not have have exact guess at the rate of failure, but is seem to happen every 30-50 server start.
I am running on a 8-way AIX machine, and have increased the server startup threads from 3 to 10, which seems to increase the rate of failure. I have also seen the problem on an solaris machine (running 3 startup thread).
The entire server process is dead locked by this failure.
======================================

Customers had done some investigation. They do believe that this is a problem with Equinox, and that there is a fix in the code. The fix would require enabling the fix via a configuration setting in the help system. Following are Information provided by them:
======================================
Here is my core dump on the what I think is happening on the server deadlock during server start.

I am running on a 8 processor AIX and Sun machine. Both sets of hardware have the problem. I have not run on windows.
The basic scenario is very simple, where the server simply started. On some instances we see a dead lock in the server runtime : 

1LKDEADLOCK    Deadlock detected !!!
NULL           ---------------------
NULL
2LKDEADLOCKTHR  Thread "server.startup : 7" (0x35148600)
3LKDEADLOCKWTR    is waiting for:
4LKDEADLOCKMON      sys_mon_t:0x3A241490 infl_mon_t: 0x3A2414B0:
4LKDEADLOCKOBJ      org/eclipse/osgi/framework/internal/protocol/StreamHandlerFactory@0xB0101770/0xB010177C:
3LKDEADLOCKOWN    which is owned by:
2LKDEADLOCKTHR  Thread "server.startup : 5" (0x37F4E100)
3LKDEADLOCKWTR    which is waiting for:
4LKDEADLOCKMON      sys_mon_t:0x3A241438 infl_mon_t: 0x3A241458:
4LKDEADLOCKOBJ      org/eclipse/equinox/servletbridge/FrameworkLauncher$ChildFirstURLClassLoader@0xB877F4D8/0xB877F4E4:
3LKDEADLOCKOWN    which is owned by:
2LKDEADLOCKTHR  Thread "server.startup : 7" (0x35148600)
NULL


The complex scenario is :
1) The WAS server runtime starts the launcher process for the server JVM
2) The OSGI environment is intialized by the server runtime
	- as part of the initialization, the singleton : ChildFirstURLClassLoader is created
3) A number of WAS component are started (no too interesting)
4) The WAS Application manager is invoked as part of the server start process to start the applications installed on the server
	- The BusinessSpacerHelp.war application is started
		- contained within the WAR is another osgi/equinox environment
		- the osgi envirment in the WAR is started
			- A new singleton : ChildFirstURLClassLoader is created
5) The dead lock occurs if another application is using the ChildFirstURLClassLoader


The problem is similar to a problem reported to WAS and fixed in PK81985 

However, where the system value was set in the customer properties for the JVM, the deadlock still occurred.
NOTE: The WAS solution consisted of two changes : 1) a fix to equinox, 2) a change to the launcher code.
(I am guessing but I think that the fix for equinox was to check to see if there had been a class already instantiated).

I believe that the reason why I continue to see the problem after running with the WAS setup, is because. 

THIS IS WHERE I NEED SOME HELP:

1) The equinox version which is running in the bspace help war does not include the fix associated with PK81985 (I do not have the associated bug report for the equinox fix).
2) The environment setting which are part of the WAS server start environment are not getting propagated to the bspace war environment. 
	- This is where I am a little fuzzy about what needs to be done, but suspect that there is an .ini file some where which could be updated which would all allow for the correct values to be set. 
	   I have looked the WAS code, and this that these value set which launching the osgi environment in the WAR would work around the problem:

	  osgi.parentClassload=fwk
	  osgi.frameworkParentClassloader=app
======================================

The logs can be found here : http://rchgsa.ibm.com/~malin/public/javacore.logs.zip

Reproducible: Sometimes
Comment 1 Leo CLA 2010-06-29 04:32:42 EDT
Created attachment 172988 [details]
logs
Comment 2 Chris Goldthorpe CLA 2010-06-29 16:38:19 EDT
Created attachment 173046 [details]
Patch to always allow application/xhtml+xml

This patch will allow the content type "application/xhtml+xml" to be returned for all browsers. It is not recommended for IE users who do not have extensions to handle xml.
Comment 3 Chris Goldthorpe CLA 2010-06-30 17:01:26 EDT
Please ignore the previous comment - it was intended for a different bug.
Comment 4 Chris Goldthorpe CLA 2010-06-30 17:08:30 EDT
Simon - does a deadlock in the ServletBridge when loading classes sound at all familiar to you. This was in Eclipse 3.4.
Comment 5 Simon Kaegi CLA 2010-06-30 23:43:11 EDT
(In reply to comment #4)
> Simon - does a deadlock in the ServletBridge when loading classes sound at all
> familiar to you. This was in Eclipse 3.4.

Perhaps... there were many fixes done so we need to narrow this down. It would be good to know what the fix in PK81985 was.
Comment 6 Chris Goldthorpe CLA 2010-07-01 13:23:22 EDT
I could not find out from the problem report exactly what was changed. Here is the description from that report. 

Deadlocking caused by OSGI protocol handler handling.
Found one Java-level deadlock:
=============================
"WebContainer : 0":
waiting to lock monitor 0x02adfaf8 (object 0x9886a3e0, a
org.eclipse.core.launcher.Main$StartupClassLoader),
which is held by "CheckPropFiles"
"CheckPropFiles":
waiting to lock monitor 0x02adfc60 (object 0x98897190, a
sun.misc.Launcher$AppClassLoader),
which is held by "WebContainer : 0"
With the deadlock you will find similar information in thread
dump as above.

    *

      * PROBLEM DESCRIPTION: A deadlock can occur when embedding     *
      *                      OSGI Equinox in an application.         *
      ****************************************************************
      * RECOMMENDATION:                                              *
      ****************************************************************
      During the initialization of the OSGI Equinox runtime, Equinox
      clears out the cache of java.net.URLStreamHandlers
      and installs its own java.net.URLStreamHandlerFactory. From
      then on URL Handler requests are routed through this
      URLStreamHandler and subsequently cached.  This occurs early
      on during server startup and is properly setup by the time
      applications are initialized and started.  When an application
      embeds Equinox in their application, this second instance
      performs the same initialization of clearing the cache and
      installing it's own URLStreamHandler during this applications
      initialization and start.  This poses a problem when different
      threads (other applications) are loading classes and resources
      as both operations use URLStreamHandlers.  A deadlock can
      occurs when the second OSGI Equinox instance initializes while
      two other threads are loading a class and a resource.

Problem conclusion

    *

      By enabling the ws.osgi.parentclassloader.fwk custom property,
      Equinox's classloader delegation order will modified to avoid
      the deadlock caused by applications with embedded Equinox.
Comment 7 Simon Kaegi CLA 2010-07-01 21:04:49 EDT
perhaps bug 303842 -- should at least help.

The suggested workaround of using "osgi.parentClassloader=fwk" is not really a great idea but then I don't know the details.
Comment 8 Chris Goldthorpe CLA 2010-07-02 17:33:20 EDT
(In reply to comment #7)
> perhaps bug 303842 -- should at least help.
> 

That looks to be a different problem -  bug 303842 is about recursing on a single thread, in this case two threads are both trying to use the ChildFirstURLClassLoader.
Comment 9 Thomas Watson CLA 2010-07-07 15:29:45 EDT
(In reply to comment #6)
> I could not find out from the problem report exactly what was changed. Here is
> the description from that report. 
> 

There was no eclipse/equinox bug opened for this issue (referred to as PK81985 in comment 0).  If I recall correctly PK81985 had an issue because the configuration for the server had the bundle class loaders using a different parent class loader (app) than the framework.  By default equinox will use the boot class loader for by the parent of the framework class loader and all bundle class loaders.

When the framework class loader's parent is different than the bundle class loader's we were seeing a case where out of order locks were being used by two threads on the framework class loader and a bundle's class loader.


> Deadlocking caused by OSGI protocol handler handling.
> Found one Java-level deadlock:
> =============================
> "WebContainer : 0":
> waiting to lock monitor 0x02adfaf8 (object 0x9886a3e0, a
> org.eclipse.core.launcher.Main$StartupClassLoader),
> which is held by "CheckPropFiles"
> "CheckPropFiles":
> waiting to lock monitor 0x02adfc60 (object 0x98897190, a
> sun.misc.Launcher$AppClassLoader),
> which is held by "WebContainer : 0"
> With the deadlock you will find similar information in thread
> dump as above.
> 

Notice that this deadlock indicates a deadlock because of a contention between the locks on two different class loaders.  I think this is different than the deadlock reported in this bug report.  This bug indicates out of order locks for the StreamHandlerFactory and the FrameworkLauncher$ChildFirstURLClassLoader.  The workaround of setting the following will not work around this issue.

      osgi.parentClassloader=fwk
      osgi.frameworkParentClassloader=app

This is because I think there is still a chance that the MultiplixingFactory will obtain a lock on the factory before delegating to a method with reflection.  Right now the only way I see a way around this is to change MultiplexingFactory to not hold its lock while invoking a method using reflection.  This would require a change to the version of equinox that WebSphere is using in its core as well as the version of equinox used in the help system.  I need to discuss this issue with Simon to see what is the best path forward.
Comment 10 Thomas Watson CLA 2010-07-07 16:02:39 EDT
Leo, if you have access to the complete core dump associated with the issue PK81985 you mention in comment 0 that would help me determine if it is a different issue or not.  Please attach it if you can find it.  Thank.
Comment 11 Chris Goldthorpe CLA 2010-07-07 19:33:56 EDT
Reassigning to Equinox Framework.
Comment 12 Leo CLA 2010-07-08 21:56:23 EDT
(In reply to comment #10)
> Leo, if you have access to the complete core dump associated with the issue
> PK81985 you mention in comment 0 that would help me determine if it is a
> different issue or not.  Please attach it if you can find it.  Thank.

Thomas, Frank has provided you the core file (http://rchgsa.ibm.com/~malin/public/core.zip). If you do not have a GSA id, please let him know and he will send you the files.
Comment 13 Thomas Watson CLA 2010-07-09 09:40:53 EDT
(In reply to comment #12)
> (In reply to comment #10)
> > Leo, if you have access to the complete core dump associated with the issue
> > PK81985 you mention in comment 0 that would help me determine if it is a
> > different issue or not.  Please attach it if you can find it.  Thank.
> 
> Thomas, Frank has provided you the core file
> (http://rchgsa.ibm.com/~malin/public/core.zip). If you do not have a GSA id,
> please let him know and he will send you the files.

Thanks, but this is the same dump for the scenario outlined in this bug.  What I really need is the full core for the issue outlined in comment 6.  There was no bug opened for that issue and so we have no history of the dump or what the actual issue was.  I kind of remember, but it was a while back.  The work around for that issue was to set:

      osgi.parentClassloader=fwk
      osgi.frameworkParentClassloader=app

But looking at the various cores attached to the bug report indicate a completely different deadlock that I am convinced will not be solved by setting the above configuration properties.
Comment 14 Thomas Watson CLA 2010-07-09 15:00:54 EDT
Created attachment 173899 [details]
work in progress

This is a rough work in progress.  The idea is that we can have multiple readers or a single writer accessing the factories field in the MultiplexingFactory.  I wrote a quick and dirty ReadersWriteLock that ensures that either multiple readers or a single writer can access the factories object at one time.

I did not use java.util.concurrent for this since we are still attempting to avoid Java 5 in the framework right now.  This should allow the cases in this bug that show the deadlock.  In that case the threads involved are only readers of the factories.
Comment 15 Thomas Watson CLA 2010-07-09 18:24:49 EDT
Created attachment 173930 [details]
work in progress 2

I found some bugs in the ReadersWriterLock.  Here are some tests + the fixes.
Comment 16 Thomas Watson CLA 2010-07-21 16:17:07 EDT
Created attachment 174911 [details]
patch

Ongoing tests are being done on the previous patch in the environment that reproduced the hang.  The results are looking good so far.  This is a slightly modified patch that applies to head cleanly.  The only code change is that I synchronized the methods of the ReadersWriteLock instead of locking on an internal lock object.
Comment 17 Thomas Watson CLA 2010-07-22 11:15:27 EDT
Created attachment 174991 [details]
patch

It turns out the ReadWrite lock approach is not sufficient and is overly complicated anyway for the problem.  This approach uses a much more simple approach using a copy on write pattern.  We also removed the synchronization on the factory instance lock while reflectively calling out to other registered factories.

This was possible because of the additional locking already present in the Framework class to protect the operations which attempt to manipulate the singleton factory fields of the VM.
Comment 18 Thomas Watson CLA 2010-07-22 11:15:55 EDT
Simon please review.
Comment 19 Simon Kaegi CLA 2010-07-22 18:28:09 EDT
Created attachment 175024 [details]
cleaned up patch

The approach looks good.

This patch is a cleanup patch that adds a synch on the getFactories call and also moves the add, remove, and release logic for factories into private synched methods.
Comment 20 Thomas Watson CLA 2010-07-23 14:07:42 EDT
Thanks for the clean up Simon.  Patch released to HEAD.
Comment 21 Thomas Watson CLA 2010-07-23 14:14:47 EDT
Created attachment 175091 [details]
3.5 or earilier patch

The last patch from Simon applies to HEAD and to 3.6.  This patch applies to 3.5 or earlier.
Comment 22 Thomas Watson CLA 2010-07-23 14:18:33 EDT
In order to fix the hang reported in this bug, the outer Equinox used to launch all of WAS and must be patched to include this fix.  You also are advised to update the Equinox used to launch the Equinox Help System to include this fix.