| Summary: | OrderedLock: interrupting during acquire can lead to deadlock | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | James Blackburn <jamesblackburn+eclipse> | ||||||||||||||||||||||||
| Component: | Runtime | Assignee: | John Arthorne <john.arthorne> | ||||||||||||||||||||||||
| Status: | RESOLVED FIXED | QA Contact: | |||||||||||||||||||||||||
| Severity: | critical | ||||||||||||||||||||||||||
| Priority: | P3 | CC: | benno.baumgartner, min123, remy.suen, steffen.pingel, tjwatson | ||||||||||||||||||||||||
| Version: | 3.6 | Flags: | tjwatson:
review+
|
||||||||||||||||||||||||
| Target Milestone: | 3.6 RC1 | ||||||||||||||||||||||||||
| Hardware: | PC | ||||||||||||||||||||||||||
| OS: | Linux-GTK | ||||||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||||
|
Description
James Blackburn
I should add my OrderedLock has my fix for bug 307282 (which I believe to be valid) - otherwise there doesn't seem to be any relevant changes compared with HEAD. Have moved o.e.core.jobs to HEAD in case it happens again. Created attachment 167291 [details]
back-trace 2
Having move to HEAD I've reproduced it. Same symptoms as before. workspace lock isn't held by anyone and 4 threads waiting.
It's not (obviously) Deadlock related. Turning on org.eclipse.core.jobs/jobs/errorondeadlock=true (which I've verified in the debugger), the deadlock IllegalStateException is never hit. Created attachment 167307 [details]
Log with OrderedLock#DEBUG = true
Created attachment 167412 [details] test Got it! Another bug in OrderedLock... I have some normal, non-deadlocking code (CDT), that's doing this with the UI very very reliably - one or two runs of this failing debug session causes the issue. Added some more tracing: [Thread[main,6,main]] Operation waiting to be executed... OrderedLock (0) [Thread[main,6,main]] doAcquire() enter OrderedLock (0) [Thread[Worker-0,5,main]] doRelease() @ depth 0 OrderedLock (0) [Thread[Worker-0,5,main]] setCurrentOperationThread() null OrderedLock (0) [Thread[Worker-0,5,main]] release() @ depth 0 OrderedLock (0) [Thread[Worker-18,5,main]] doAcquire() success=true OrderedLock (0) [Thread[Worker-18,5,main]] setCurrentOperationThread() Thread[Worker-18,5,main] OrderedLock (0) [Thread[Worker-18,5,main]] Operation started... OrderedLock (0) [Thread[Worker-18,5,main]] doRelease() @ depth 0 OrderedLock (0) [Thread[Worker-18,5,main]] setCurrentOperationThread() null OrderedLock (0) [Thread[Worker-18,5,main]] release() @ depth 0 OrderedLock (0) [Thread[Worker-0,5,main]] Operation waiting to be executed... OrderedLock (0) [Thread[Worker-0,5,main]] doAcquire() enter OrderedLock (0) [Thread[main,6,main]] Operation interrupted while waiting... :-| OrderedLock (0) What you can see in the last two lines is this: The main thread has been interrupted *just* as the Semaphore is released(). There are 2 races here: 1) The semaphore can be acquired (notification++) just before the thread is interrupted. i.e. you can get an ordering of: sem.notification++; interrupt(); notify() -> not a successful acquire(). 2) Semaphore interrupted(); sem.notification++ In both cases the semaphore is discarded as a result of the interruption, so the notification is lost. Now this is nearly always ok. The next thread trying to acquire the lock succeeds -- and this might be the same thread called from #acquire(). However if there's more than one thread in contention, then the new thread will just join the queue. The lock is then lost forever. Test attached. It's unfortunately a data race internal to OrderedLock, so it's a pretty long test... The safe solution could be one of two: - check whether the Sempaphore can be acquire(0) before discarding it - don't throw away the semaphore Some sanity checks could be added to OrderedLock when a thread is queued that currentThread == null || operations.empty() -- though this may be difficult as doAcquire() isn't synchronized. John I think this was the original source of the bug 294894. Lots of interrupting happens when the UI thread tries to acquire locks, and, afaics, it always goes through the acquire() code path which would suggest it's not bug 307282. Created attachment 167415 [details]
test 2
Much more aggressive test. TestBarrier takes too long to provoke the race we're trying to catch.
This test has the 3 threads fighting for the lock, with the main thread interrupting them. It fails for me reliably in the first iteration.
Comment out the intterupt()s to make the test pass...
Created attachment 167421 [details]
patch 1
Potential patch for the issue.
doAcquire() isn't synchronized (it calls external code in the DeadlockDetector, etc.). However we should, as part of a synchronized block, test whether the semaphore has been released before discarding it.
This patch does a check of the semaphore followed by a removal from the operations queue if unsuccessful, in a single sychronized block. This needs to be in one sync block to prevent the 2 issues discussed previously. As the #doRelease() uses the same lock, this prevents concurrent notification and stops us from missing a lock release.
Bumping to critical. This issue makes it very easy for the UI to break runtime locks. I have only had a chance to look at this briefy, but the fix isn't right. We must call manager.aboutToWait when in a loop waiting for a lock. This allows us to resolve deadlocks between an OrderedLock and the UI thread (lock owner does syncExec while UI is waiting for that lock). There must be a simpler fix here but I need to look at it more. The patch doesn't change that the flow. All it does is acquire(_0_) before the semaphore dequeue to check that it hasn't been notified concurrently immediately before. The reason for the while loop is that acquire(...) can throw interruptexception. The intention is not tospend any time doing this, just catch the race. I agree it's not the prettiest fix. But afaics the only other alternative is to modify semaphore. Created attachment 167587 [details] patch 2 This is (I believe) an equivalent patch, without the ugly while loop. Semaphore.attempt() method added to attempt to acquire the semaphore without waiting. The key fix in this (and patch 1) is that the semaphore is checked and dequeued while synchronized on the OrderedLock monitor. This is the same monitor used in doRelease(). I think the fix for bug 307282 made this issue worse for callers of acquire(). Previously the semaphore wasn't discarded, now it always is, so the bug is as easily visible here as it is with acquire(int). Created attachment 167598 [details]
patch 3
Even tidier patch, it looks like OrderedLock monitor is acquired before the DeadlockDetector. So it's ok to put the whole doAcquire() exit in the synchronized.
Created attachment 167776 [details]
patch 4
Minor tidying on patch 3. The only changes are copyright updates, and refactoring the new synchronized block into a separate method.
I tried various other approaches but it amounts to the same thing. We need to atomically consider the semaphore notification state before discarding it. Since we can't wait while holding the object local on OrderedLock, the only choice is to check the semaphore state again from within a synchronized block.
Tom, another bug for review in jobs bundle.
In OrderedLock#doAcquire we had this code (slightly simplified):
1. success = semaphore.acquire(delay);
2. if (success) {
3. depth++;
4. updateCurrentOperation();
5. } else {
6. removeFromQueue(semaphore);
7. manager.removeLockWaitThread(currentThread, this);
8. }
The bug occurs when the acquire did not succeed so "success=false". There are a few ways this can happen - we reached the specified delay or we were interrupted. Now, *after* line 1 and before entering the method on line 6, someone releases the lock, which notifies the semaphore. We then discard the semaphore on line 6, and the notification is "lost". Now nobody owns the lock but there could be other threads waiting in the queue. If so, those threads are deadlocked forever, and any subsequent attempt to acquire the lock also blocks forever.
The failure case is obviously catastrophic, but very unlikely because it requires hitting a small timing window, *and* having at least three threads involved with the lock at once. However I occasionally see stack dumps where many threads are waiting on an OrderedLock and nobody seems to own it, which is quite possibly caused by this bug.
(In reply to comment #13) > Created an attachment (id=167776) [details] > patch 4 > > Minor tidying on patch 3. The only changes are copyright updates, and > refactoring the new synchronized block into a separate method. Thanks John, looks good to me. Created attachment 167794 [details]
test 3
Small tidy to test2
- Make it tear-down cleanly
+ auxiliary threads unnecessarily assert
+ make main thread check for DONE status on ThreadBarriers at end
- Fix warnings: no need to call Thread.stop as the test is expected to pass.
Created attachment 167800 [details]
test 4
Even more tidy test. Generalise the Thread instances to a generic TestThread nested class.
I'll stop now...
+1, this is a great find and fix. Thanks James! I think this is important to get in since as John mentions in comment 14, we have seen stack dumps that could be explained by this bug. Thanks all. Patch 4 and test 4 released. Min, just CCing FYI since I heard you were also chasing a mysterious deadlock. Note I put iplog+ on patch #3, since my patch #4 is just a tidied version of James' patch #3. Thanks for the CC John, I think this might explain a few random/rare issues that I've seen before. I originally thought this was a long-standing bug, but it looks like this was introduced by the fix for bug 307282. Before that fix, there was no cleanup performed on interrupt so this wouldn't have happened. *** Bug 308235 has been marked as a duplicate of this bug. *** *** Bug 312543 has been marked as a duplicate of this bug. *** *** Bug 313086 has been marked as a duplicate of this bug. *** |