Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 344727

Summary: [Workbench] Deadlock in of workbench use of synchronization
Product: [Eclipse Project] Platform Reporter: Jeff McAffer <jeffmcaffer>
Component: UIAssignee: Oleg Besedin <ob1.eclipse>
Status: VERIFIED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: bokowski, daniel_megert, emoffatt, irbull, jens.borrmann, Mike_Wilson, niklas.deutschmann, ob1.eclipse, prakash, pwebster, remy.suen, tjwatson
Version: 3.6.1   
Target Milestone: 3.8 M2   
Hardware: PC   
OS: Mac OS X - Carbon (unsup.)   
Whiteboard:
Bug Depends on:    
Bug Blocks: 347740    

Description Jeff McAffer CLA 2011-05-04 12:31:02 EDT
in WorkbenchPlugin.bundleChanged there is a synchronize block on a hashset.  This block is called whenever a change to the state of a bundle is detected.  The listener listening for these change events is a SynchronousBundleListener (see getBundleListener).  The net effect is that while a bundle is being activated (i.e., its state is changing), the workbench's listener is notified and synchronizes on the hashset.

It seems that through a quirk of classloading or something, hitting the event.getType() line in bundleChanged() can trigger some classloading.  I'm not sure exactly what but the thread that has the lock on the hashset blocks trying to get a lock on the classloader.  If some other thread has the classloader lock and is trying to change the state of a bundle (thus resulting in a bundleChanged() call) then we get deadlock.

Indeed that is what is happening in a case where the PDE Build application is being run using an Eclipse IDE install.  See bug 344030.

While this may highlight a particular problem with the JVM or Equinox, it would help if we could do one or more of the following:

- change the bundle listener to be Asynchronous (not sure I fully understand what the listener is doing or how this would affect that)

- move the event.getType() call outside the synchronized block.  We are testing this change to see if it helps.
Comment 1 Jeff McAffer CLA 2011-05-25 17:05:08 EDT
Moving the event.getType call to outside the sync block appears to have addressed the problem. The change seems very safe.  Can we look at getting this in for Indigo?
Comment 2 Oleg Besedin CLA 2011-05-26 11:23:36 EDT
Today is the testing/verification of RC3 so this would have to go into RC4. That needs PMC approval so I'll add McQ and Boris to the list.
Comment 3 Boris Bokowski CLA 2011-05-26 19:37:02 EDT
(In reply to comment #0)
> - change the bundle listener to be Asynchronous (not sure I fully understand
> what the listener is doing or how this would affect that)

I don't understand either but see bug 171833 comment 0: "As per bug 171584 the platform needs to use a synchronized bundle listener to be able to see the LAZILY_AWAITING_ACTIVATION state."

> - move the event.getType() call outside the synchronized block.  We are testing
> this change to see if it helps.

Sounds like a safe fix to me.
Comment 4 Mike Wilson CLA 2011-05-28 16:00:31 EDT
(In reply to comment #3)
> > - move the event.getType() call outside the synchronized block.  We are testing
> > this change to see if it helps.
> 
> Sounds like a safe fix to me.

True if we're certain that getType is side-effect free. Of course, the fact that it's causing class loading at least hints that that might not be the case...
Comment 5 Jeff McAffer CLA 2011-05-29 17:25:57 EDT
(In reply to comment #4)
> True if we're certain that getType is side-effect free. Of course, the fact
> that it's causing class loading at least hints that that might not be the
> case...

There are functional side effects and underlying VM side-effects.  There are no functional side-effects here.  I could see it changing the locking behaviour (underlying VM behaviour) so could in theory changing the timing of something else.  I don't know how we could test or validate that.  As it is without the change, it seemed rather easy for us to get a deadlock.  Just do a PDE build running a full Eclipse install's ant runner application. We may have been using the JEE package so perhaps the different set of bundles installed is causing a different sequence of bundle events...

If the change is deemed too much at this point we can retag for SR1.
Comment 6 Oleg Besedin CLA 2011-05-30 11:37:18 EDT
How about we do it early in 3.8 and back port the change to 3.7.1?

I am a bit hesitant to make a change at this point of 3.7 as I can't reproduce a deadlock (could it be specific to Mac?)
Comment 7 Jeff McAffer CLA 2011-05-30 12:49:11 EDT
The deadlock actually happens on some linux build machines (so not specific to Mac) and as it is a race condition, you have to run many builds for some of them to lock up.

I'm fine doing it in 3.8 and backporting.
Comment 8 Ian Bull CLA 2011-07-20 22:06:29 EDT
(In reply to comment #6)
> How about we do it early in 3.8 and back port the change to 3.7.1?
> 
> I am a bit hesitant to make a change at this point of 3.7 as I can't reproduce
> a deadlock (could it be specific to Mac?)

I was able to reproduce the deadlock on Linux (both 32 and 64 bit).
Comment 9 Oleg Besedin CLA 2011-08-31 16:46:26 EDT
Added the suggested change in the 3.8 and 4.2 streams:

http://git.eclipse.org/c/platform/eclipse.platform.ui.git/commit/?h=R3_development&id=1e4ca2069cf2dcc2cdb4112a6d1eb45783e4b810
Comment 10 Remy Suen CLA 2011-08-31 17:02:33 EDT
(In reply to comment #9)
> Added the suggested change in the 3.8 and 4.2 streams:
> 
> http://git.eclipse.org/c/platform/eclipse.platform.ui.git/commit/?h=R3_development&id=1e4ca2069cf2dcc2cdb4112a6d1eb45783e4b810

I don't see this change in R4_development. Are you trying out Andrew's script?
Comment 11 Oleg Besedin CLA 2011-09-01 09:40:01 EDT
(In reply to comment #10)
> I don't see this change in R4_development. Are you trying out Andrew's script?

Nope, that would be cool :-). I forgot to push the repository after making R4 change. Done this morning:

http://git.eclipse.org/c/platform/eclipse.platform.ui.git/commit/?h=R4_development&id=12a874c06118ea95a0fb2beb1312739f27f9775b
Comment 12 Oleg Besedin CLA 2011-09-13 10:38:56 EDT
Verified that updated code is present in the I20110913-0200 build.
Comment 13 Jens Borrmann CLA 2011-09-30 01:43:59 EDT
*** Bug 323238 has been marked as a duplicate of this bug. ***