Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 311863 - OrderedLock: interrupting during acquire can lead to deadlock
Summary: OrderedLock: interrupting during acquire can lead to deadlock
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: Runtime (show other bugs)
Version: 3.6   Edit
Hardware: PC Linux-GTK
: P3 critical (vote)
Target Milestone: 3.6 RC1   Edit
Assignee: John Arthorne CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 308235 312543 313086 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-05-06 08:31 EDT by James Blackburn CLA
Modified: 2010-05-17 11:45 EDT (History)
5 users (show)

See Also:
tjwatson: review+


Attachments
back trace (14.48 KB, text/plain)
2010-05-06 08:31 EDT, James Blackburn CLA
no flags Details
back-trace 2 (35.45 KB, text/plain)
2010-05-06 08:45 EDT, James Blackburn CLA
no flags Details
Log with OrderedLock#DEBUG = true (45.77 KB, text/plain)
2010-05-06 10:06 EDT, James Blackburn CLA
no flags Details
test (4.90 KB, patch)
2010-05-06 18:29 EDT, James Blackburn CLA
no flags Details | Diff
test 2 (5.23 KB, patch)
2010-05-06 19:34 EDT, James Blackburn CLA
no flags Details | Diff
patch 1 (2.05 KB, patch)
2010-05-06 20:15 EDT, James Blackburn CLA
no flags Details | Diff
patch 2 (2.24 KB, patch)
2010-05-07 18:15 EDT, James Blackburn CLA
no flags Details | Diff
patch 3 (2.23 KB, patch)
2010-05-08 07:54 EDT, James Blackburn CLA
john.arthorne: iplog+
Details | Diff
patch 4 (3.68 KB, patch)
2010-05-10 14:29 EDT, John Arthorne CLA
no flags Details | Diff
test 3 (5.42 KB, patch)
2010-05-10 15:34 EDT, James Blackburn CLA
no flags Details | Diff
test 4 (4.80 KB, patch)
2010-05-10 15:46 EDT, James Blackburn CLA
john.arthorne: iplog+
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description James Blackburn CLA 2010-05-06 08:31:45 EDT
Created attachment 167289 [details]
back trace

+++ This bug was initially created as a clone of Bug #294894 +++

This has happened again :(

Just encountered a lock-up during debug session tear-down. AFAICS the hung threads (including main) are waiting for the workspace lock, and no one holds it...

Looking at the fields in the OrderedLock:

this	OrderedLock  (id=3361)	
	currentOperationThread	null	
	depth	0	
	manager	LockManager  (id=3520)	
	number	0	
	operations	Queue  (id=3522)	

DeadlockDetector graph looks like:
graph	 (id=3540)	[[-1, 0], [-1, 1], [-1, 0], [-1, 0], [-1, 0]]
locks:
[0]	OrderedLock  (id=3361)	
[1]	WorkspaceRoot  (id=3500)	

I didn't have lock tracing on, so I'm not sure whether the deadlock detector came into play before this lock was lost.
Comment 1 James Blackburn CLA 2010-05-06 08:37:28 EDT
I should add my OrderedLock has my fix for bug 307282 (which I believe to be valid) - otherwise there doesn't seem to be any relevant changes compared with HEAD.  Have moved o.e.core.jobs to HEAD in case it happens again.
Comment 2 James Blackburn CLA 2010-05-06 08:45:53 EDT
Created attachment 167291 [details]
back-trace 2

Having move to HEAD I've reproduced it. Same symptoms as before. workspace lock isn't held by anyone and 4 threads waiting.
Comment 3 James Blackburn CLA 2010-05-06 08:55:51 EDT
It's not (obviously) Deadlock related. Turning on org.eclipse.core.jobs/jobs/errorondeadlock=true (which I've verified in the debugger), the deadlock IllegalStateException is never hit.
Comment 4 James Blackburn CLA 2010-05-06 10:06:58 EDT
Created attachment 167307 [details]
Log with OrderedLock#DEBUG = true
Comment 5 James Blackburn CLA 2010-05-06 18:29:13 EDT
Created attachment 167412 [details]
test

Got it! Another bug in OrderedLock... 

I have some normal, non-deadlocking code (CDT), that's doing this with the UI very very reliably - one or two runs of this failing debug session causes the issue.

Added some more tracing:
[Thread[main,6,main]] Operation waiting to be executed...  OrderedLock (0)
[Thread[main,6,main]] doAcquire() enter OrderedLock (0)
[Thread[Worker-0,5,main]] doRelease() @ depth 0 OrderedLock (0)
[Thread[Worker-0,5,main]] setCurrentOperationThread() null OrderedLock (0)
[Thread[Worker-0,5,main]] release() @ depth 0 OrderedLock (0)
[Thread[Worker-18,5,main]] doAcquire() success=true OrderedLock (0)
[Thread[Worker-18,5,main]] setCurrentOperationThread() Thread[Worker-18,5,main] OrderedLock (0)
[Thread[Worker-18,5,main]] Operation started...  OrderedLock (0)
[Thread[Worker-18,5,main]] doRelease() @ depth 0 OrderedLock (0)
[Thread[Worker-18,5,main]] setCurrentOperationThread() null OrderedLock (0)
[Thread[Worker-18,5,main]] release() @ depth 0 OrderedLock (0)
[Thread[Worker-0,5,main]] Operation waiting to be executed...  OrderedLock (0)
[Thread[Worker-0,5,main]] doAcquire() enter OrderedLock (0)
[Thread[main,6,main]] Operation interrupted while waiting... :-| OrderedLock (0)

What you can see in the last two lines is this:
   The main thread has been interrupted *just* as the Semaphore is released(). 
There are 2 races here:
  1) The semaphore can be acquired (notification++) just before the thread is interrupted. i.e. you can get an ordering of: sem.notification++; interrupt(); notify() -> not a successful acquire().
  2) Semaphore interrupted(); sem.notification++

In both cases the semaphore is discarded as a result of the interruption, so the notification is lost.

Now this is nearly always ok. The next thread trying to acquire the lock succeeds -- and this might be the same thread called from #acquire().

However if there's more than one thread in contention, then the new thread will just join the queue. The lock is then lost forever.

Test attached. It's unfortunately a data race internal to OrderedLock, so it's a pretty long test...

The safe solution could be one of two:
 - check whether the Sempaphore can be acquire(0) before discarding it
 - don't throw away the semaphore

Some sanity checks could be added to OrderedLock when a thread is queued that currentThread == null || operations.empty() -- though this may be difficult as doAcquire() isn't synchronized.

John I think this was the original source of the bug 294894.  Lots of interrupting happens when the UI thread tries to acquire locks, and, afaics, it always goes through the acquire() code path which would suggest it's not bug 307282.
Comment 6 James Blackburn CLA 2010-05-06 19:34:52 EDT
Created attachment 167415 [details]
test 2

Much more aggressive test. TestBarrier takes too long to provoke the race we're trying to catch.

This test has the 3 threads fighting for the lock, with the main thread interrupting them.  It fails for me reliably in the first iteration.

Comment out the intterupt()s to make the test pass...
Comment 7 James Blackburn CLA 2010-05-06 20:15:16 EDT
Created attachment 167421 [details]
patch 1

Potential patch for the issue.

doAcquire() isn't synchronized (it calls external code in the DeadlockDetector, etc.). However we should, as part of a synchronized block, test whether the semaphore has been released before discarding it.

This patch does a check of the semaphore followed by a removal from the operations queue if unsuccessful, in a single sychronized block.  This needs to be in one sync block to prevent the 2 issues discussed previously.  As the #doRelease() uses the same lock, this prevents concurrent notification and stops us from missing a lock release.
Comment 8 James Blackburn CLA 2010-05-06 20:16:42 EDT
Bumping to critical. This issue makes it very easy for the UI to break runtime locks.
Comment 9 John Arthorne CLA 2010-05-07 17:38:14 EDT
I have only had a chance to look at this briefy, but the fix isn't right. We must call manager.aboutToWait when in a loop waiting for a lock. This allows us to resolve deadlocks between an OrderedLock and the UI thread (lock owner does syncExec while UI is waiting for that lock). There must be a simpler fix here but I need to look at it more.
Comment 10 James Blackburn CLA 2010-05-07 17:46:37 EDT
The patch doesn't change that the flow.  

All it does is acquire(_0_) before the semaphore dequeue to check that it hasn't been notified concurrently immediately before.  The reason for the while loop is that acquire(...) can throw interruptexception. The intention is not tospend any time doing this, just catch the race.  

I agree it's not the prettiest fix. But afaics the only other alternative is to modify semaphore.
Comment 11 James Blackburn CLA 2010-05-07 18:15:59 EDT
Created attachment 167587 [details]
patch 2

This is (I believe) an equivalent patch, without the ugly while loop.

Semaphore.attempt() method added to attempt to acquire the semaphore without waiting.  The key fix in this (and patch 1) is that the semaphore is checked and dequeued while synchronized on the OrderedLock monitor.  This is the same monitor used in doRelease().

I think the fix for bug 307282 made this issue worse for callers of acquire(). Previously the semaphore wasn't discarded, now it always is, so the bug is as easily visible here as it is with acquire(int).
Comment 12 James Blackburn CLA 2010-05-08 07:54:48 EDT
Created attachment 167598 [details]
patch 3

Even tidier patch, it looks like OrderedLock monitor is acquired before the DeadlockDetector.  So it's ok to put the whole doAcquire() exit in the synchronized.
Comment 13 John Arthorne CLA 2010-05-10 14:29:22 EDT
Created attachment 167776 [details]
patch 4

Minor tidying on patch 3. The only changes are copyright updates, and refactoring the new synchronized block into a separate method.

I tried various other approaches but it amounts to the same thing. We need to atomically consider the semaphore notification state before discarding it. Since we can't wait while holding the object local on OrderedLock, the only choice is to check the semaphore state again from within a synchronized block.
Comment 14 John Arthorne CLA 2010-05-10 14:39:09 EDT
Tom, another bug for review in jobs bundle.

In OrderedLock#doAcquire we had this code (slightly simplified):

1.	success = semaphore.acquire(delay);
2.	if (success) {
3.		depth++;
4.		updateCurrentOperation();
5.	} else {
6.		removeFromQueue(semaphore);
7.		manager.removeLockWaitThread(currentThread, this);
8.	}

The bug occurs when the acquire did not succeed so "success=false". There are a few ways this can happen - we reached the specified delay or we were interrupted. Now, *after* line 1 and before entering the method on line 6, someone releases the lock, which notifies the semaphore. We then discard the semaphore on line 6, and the notification is "lost". Now nobody owns the lock but there could be other threads waiting in the queue. If so, those threads are deadlocked forever, and any subsequent attempt to acquire the lock also blocks forever.

The failure case is obviously catastrophic, but very unlikely because it requires hitting a small timing window, *and* having at least three threads involved with the lock at once. However I occasionally see stack dumps where many threads are waiting on an OrderedLock and nobody seems to own it, which is quite possibly caused by this bug.
Comment 15 James Blackburn CLA 2010-05-10 14:52:01 EDT
(In reply to comment #13)
> Created an attachment (id=167776) [details]
> patch 4
> 
> Minor tidying on patch 3. The only changes are copyright updates, and
> refactoring the new synchronized block into a separate method.

Thanks John, looks good to me.
Comment 16 James Blackburn CLA 2010-05-10 15:34:39 EDT
Created attachment 167794 [details]
test 3

Small tidy to test2
 - Make it tear-down cleanly  
      + auxiliary threads unnecessarily assert
     + make main thread check for DONE status on ThreadBarriers at end
 - Fix warnings: no need to call Thread.stop as the test is expected to pass.
Comment 17 James Blackburn CLA 2010-05-10 15:46:14 EDT
Created attachment 167800 [details]
test 4

Even more tidy test. Generalise the Thread instances to a generic TestThread nested class.
I'll stop now...
Comment 18 Thomas Watson CLA 2010-05-10 16:15:32 EDT
+1, this is a great find and fix.  Thanks James!  I think this is important to get in since as John mentions in comment 14, we have seen stack dumps that could be explained by this bug.
Comment 19 John Arthorne CLA 2010-05-10 16:39:58 EDT
Thanks all. Patch 4 and test 4 released. Min, just CCing FYI since I heard you were also chasing a mysterious deadlock.
Comment 20 John Arthorne CLA 2010-05-10 16:41:48 EDT
Note I put iplog+ on patch #3, since my patch #4 is just a tidied version of James' patch #3.
Comment 21 Min Idzelis CLA 2010-05-10 17:00:56 EDT
Thanks for the CC John, I think this might explain a few random/rare issues that I've seen before.
Comment 22 John Arthorne CLA 2010-05-17 11:28:35 EDT
I originally thought this was a long-standing bug, but it looks like this was introduced by the fix for bug 307282. Before that fix, there was no cleanup performed on interrupt so this wouldn't have happened.
Comment 23 John Arthorne CLA 2010-05-17 11:30:57 EDT
*** Bug 308235 has been marked as a duplicate of this bug. ***
Comment 24 John Arthorne CLA 2010-05-17 11:31:04 EDT
*** Bug 312543 has been marked as a duplicate of this bug. ***
Comment 25 Dani Megert CLA 2010-05-17 11:45:44 EDT
*** Bug 313086 has been marked as a duplicate of this bug. ***