| Summary: | Workspace lock not released leading to deadlock | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | James Blackburn <jamesblackburn+eclipse> | ||||||||||||||||||||||
| Component: | Resources | Assignee: | John Arthorne <john.arthorne> | ||||||||||||||||||||||
| Status: | CLOSED DUPLICATE | QA Contact: | |||||||||||||||||||||||
| Severity: | major | ||||||||||||||||||||||||
| Priority: | P3 | ||||||||||||||||||||||||
| Version: | 3.5.1 | ||||||||||||||||||||||||
| Target Milestone: | 3.6 | ||||||||||||||||||||||||
| Hardware: | PC | ||||||||||||||||||||||||
| OS: | Linux-GTK | ||||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||
Created attachment 151954 [details]
metadata log
Hmmm. Just saw this in the metadata log immediately before the deadlock occurs...
It looks like something bad's happening in the UI leading to a JFace runtime error (which may very well be my fault...).
Shouldn't the workspace lock be released to on failures such as this to prevent Eclipse locking up like this?
Created attachment 152020 [details]
build waiting on main
Image showing build waiting on main.
Created attachment 152022 [details]
blocking job is AutoBuild
I had a go writing some unit tests to provoke this -- replicating the exceptions in the metadata log. I've been unsuccessful in provoking a lockup so far :( Will see if I can reproduce with locks and job tracing on. Created attachment 152031 [details]
Reproduction 2
Can reproduce it reasonably reliably - though not yet with lock tracing on.
In the attached stack, main is in the normal display loop. However workmanager lock is apparently still owned by it (orderedLoc.currentOperationThread()).
Created attachment 152084 [details]
Test 1
After much gnashing of teeth. I cached the stack traces acquire the workspace lock in Ordered Lock and noticed that depth is neither volatile (though this is pretty meaningless in Java 1.4) nor is it always synchronised.
I added some assertions that the thread modifying depth was currentThread, and out popped LockManager#addLocakWaitThread(...) -> LockState#suspend(...) -> OrderedLock#forceRelease(...)
The attached JUnit is the result of this insight: create a deadlock condition between UI and Resources, and wait until the workspace locks up. Usually 10 - 100 iterations of my outer loop.
This doesn't look thread-safe to me.
John / Szymon the JavaDoc on ordered lock states:
" * The lock avoids circular waiting deadlocks by detecting the deadlocks
* and resolving them through the suspension of all locks owned by one
* of the threads involved in the deadlock. This makes it impossible for n such
* locks to deadlock while waiting for each other..."
Any glimmers of insight?
Crashing patch attached.
Bumping to major as it locks up the IDE without any indication of what went wrong. If it is deadlock detection to blame then it's likely API-user error, but at the very worst core.resources should fail gracefully. Created attachment 152085 [details]
Test 1 patch
Completely wrong patch attached previously.
Created attachment 163428 [details]
Updated test
This moves the test into the UI test bundle where it has access to SWT API.
I am unable to reproduce the deadlock with this test, but I suspect this is a duplicate of bug 307282. Our implementation of syncExec calls Thread.interrupt() on the UI thread in order to get a blocked UI thread to process syncExec requests. So, this could easily result in the broken lock state described in bug 307282. However, it would be useful to have a reproducible test to verify this. Note I tried running the test both with and without the fix for bug 307282 but it didn't deadlock in either case. James, can you still reproduce this? Created attachment 163433 [details] backtrace (In reply to comment #10) > So, this could easily result in the broken lock > state described in bug 307282. I think you're right, the manifestation of this bug - orphaned locks - is identical to what I was seeing when the user first encountered the problem. I haven't been able to reproduce this test and 3.6 at work either... However just running the test here on my mac (with a clean core.resources & core.jobs) I'm getting a reliable: "Attempted to beginRule: R/, does not match outer scope rule: L/testProject/foo" (In reply to comment #11) > However just running the test here on my mac (with a clean core.resources & > core.jobs) I'm getting a reliable: > "Attempted to beginRule: R/, does not match outer scope rule: > L/testProject/foo" This is a bug in the test. It attempts to do refreshLocal on workspace root from within a syncExec. If that syncExec occurs in the UI thread while the UI thread holds a different lock, you'll get this failure. Here is the trace: Thread [main] (Suspended (exception IllegalArgumentException)) owns: ImplicitJobs (id=858) owns: RunnableLock (id=859) Assert.isLegal(boolean, String) line: 63 ThreadJob.illegalPush(ISchedulingRule, ISchedulingRule) line: 136 ThreadJob.push(ISchedulingRule) line: 326 ImplicitJobs.begin(ISchedulingRule, IProgressMonitor, boolean) line: 63 JobManager.beginRule(ISchedulingRule, IProgressMonitor) line: 285 WorkManager.checkIn(ISchedulingRule, IProgressMonitor) line: 117 Workspace.prepareOperation(ISchedulingRule, IProgressMonitor) line: 1914 WorkspaceRoot(Resource).refreshLocal(int, IProgressMonitor) line: 1655 Bug_294894$6.run() line: 156 UILockListener.doPendingWork() line: 164 UILockListener.aboutToWait(Thread) line: 126 LockManager.aboutToWait(Thread) line: 123 OrderedLock.doAcquire(Semaphore, long) line: 157 OrderedLock.acquire(long) line: 109 OrderedLock.acquire() line: 83 WorkManager.checkIn(ISchedulingRule, IProgressMonitor) line: 118 Workspace.prepareOperation(ISchedulingRule, IProgressMonitor) line: 1914 File.appendContents(InputStream, int, IProgressMonitor) line: 47 Bug_294894.testDeadLockFail() line: 191 (In reply to comment #12) > (In reply to comment #11) > > However just running the test here on my mac (with a clean core.resources & > > core.jobs) I'm getting a reliable: > > "Attempted to beginRule: R/, does not match outer scope rule: > > L/testProject/foo" > > This is a bug in the test. It attempts to do refreshLocal on workspace root > from within a syncExec. If that syncExec occurs in the UI thread while the UI > thread holds a different lock, you'll get this failure. Here is the trace: Right, that's interesting. Presumably this can happen anytime: UI: blocked while holding a resource rule Job: does syncExec which tries to grab a conflicting / non-contained scheduling rule. even though the api user believes they're acquiring locks in the same order -- i.e. UI lock first. This seems counter-intuitive to me, shouldn't the original scheduling rule be release while doing pending work? (In reply to comment #13) > even though the api user believes they're acquiring locks in the same order -- > i.e. UI lock first. This seems counter-intuitive to me, shouldn't the original > scheduling rule be release while doing pending work? What we do now is effectively "transfer" the lock into the syncExec from the caller. So, the syncExec can modify the same things that the caller of the syncExec could modify. I.e., it's as if the body of the syncExec was occurring in the thread that invoked the syncExec. I don't think releasing the lock while calling syncExec is a safe solution - it would mean that any other thread could obtain that scheduling rule and therefore violate the thread safety that scheduling rules are intended to provide. Created attachment 163463 [details] test take 3 (In reply to comment #14) > What we do now is effectively "transfer" the lock into the syncExec from the > caller. So, the syncExec can modify the same things that the caller of the > syncExec could modify. I.e., it's as if the body of the syncExec was occurring > in the thread that invoked the syncExec. AFAICS that's not what's happening here -- the confusion might be that there are two separate jobs being scheduled - both called j *ugh*. I've renamed the 'inner' job j2, and corrected indentation. main: File scheduling Rule acquired to modify file, attempting Workmanager ILock at org.eclipse.core.internal.resources.File.appendContents(File.java:47) at org.eclipse.ui.tests.concurrency.Bug_294894.testDeadLockFail(Bug_294894.java:190) Job j2: Not run with any scheduling rules, attempts sync exec. main: before waiting for the Workmanager lock processes 'pendingWork. at org.eclipse.core.internal.resources.Resource.refreshLocal(Resource.java:1655) at org.eclipse.ui.tests.concurrency.Bug_294894$6.run(Bug_294894.java:155) at org.eclipse.ui.internal.UILockListener.doPendingWork(UILockListener.java:164) at org.eclipse.ui.internal.UILockListener.aboutToWait(UILockListener.java:126) at org.eclipse.core.internal.jobs.LockManager.aboutToWait(LockManager.java:123) Job: j2 in the sync exec runs, attempts resource changing operation at org.eclipse.core.internal.resources.Resource.refreshLocal(Resource.java:1655) => syncExec at org.eclipse.ui.tests.concurrency.Bug_294894$6.run(Bug_294894.java:155) The syncExec / Job does not itself hold any IScheduling rules. The UI thread has already acquire the file sched rule when it went into the doPendingWork. Hence the subsequent failure. The problem I see is that this code: Job j2 = new Job("Workspace refresh") { protected IStatus run(IProgressMonitor monitor) { Display.getDefault().syncExec(new Runnable() { public void run() { try { workspace.getRoot().refreshLocal(IResource.DEPTH_INFINITE, null); } catch (Exception e) { fail("refresh run", e); } } }); return Status.OK_STATUS; } }; j2.schedule(); doesn't have any control/knowledge over what scheduling rules the UI thread already owns. > So, the syncExec can modify the same things that the caller of the > syncExec could modify. I.e., it's as if the body of the syncExec was occurring > in the thread that invoked the syncExec. This would have worked correctly without the syncExec as the the root rule acquire would have just blocked. The problem seems to be that the syncExec holds the scheduling rule (and locks?) of whatever's blocked and processing pendingWork. I suspect it wouldn't be too hard to write a test that demonstrated this (without the massive loop), perhaps: - attach a resource change listener - Workspace.run( null_rule) { modify file } - resource change listener should block (workspace lock held), no scheduling rules - attempt a resource change operation on the UI thread - {a}sync exec in a job which contains another scheduling rule acquire -- either non-contained resource rule, or a non-resource rule. As for the original issue, I've been unable to reproduce it in 3.6, so I'd happy to close this bug and possibly file another if you agree the underlying behaviour of sched rules in the UI thread is suspect. Closing the original problem as a duplicate. I don't see a problem with the current interaction of syncExec and locks but you're welcome to enter a bug report if you think you see a problem. Keep in mind that whole system is designed to handle the trivial deadlock case where thread A owns a lock and performs syncExec, and Thread B is the UI thread waiting for the lock held by A. There are certainly more complex lock interactions that we cannot safely resolve automatically, and it's up to the client code to avoid unsafe practices (like attempting to acquire arbitrary locks in a syncExec where you don't know the lock state of the thread you are going to run in). *** This bug has been marked as a duplicate of bug 307282 *** |
Created attachment 151953 [details] backtrace A user has been getting a lot of these... The AutoBuildJob is deadlocking with another Thread. - AutoBuildJob has acquired the workspace sheduling rule && is trying to acquire the WorkManager lock - Another thread, in this case main, is trying to acquire a resource scheduling rule, but the main thread is apparently the thread holding the workspace lock... According to the debugger, the WorkManager lock is currently held by the main thread. I can't see how that can't be the case, as the main thread isn't doing any work checked into the workspace. Backtraces and screenshots attached.