Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 294894

Summary: Workspace lock not released leading to deadlock
Product: [Eclipse Project] Platform Reporter: James Blackburn <jamesblackburn+eclipse>
Component: ResourcesAssignee: John Arthorne <john.arthorne>
Status: CLOSED DUPLICATE QA Contact:
Severity: major    
Priority: P3    
Version: 3.5.1   
Target Milestone: 3.6   
Hardware: PC   
OS: Linux-GTK   
Whiteboard:
Attachments:
Description Flags
backtrace
none
metadata log
none
build waiting on main
none
blocking job is AutoBuild
none
Reproduction 2
none
Test 1
none
Test 1 patch
none
Updated test
none
backtrace
none
test take 3 none

Description James Blackburn CLA 2009-11-11 12:54:50 EST
Created attachment 151953 [details]
backtrace

A user has been getting a lot of these...

The AutoBuildJob is deadlocking with another Thread. 
 -  AutoBuildJob has acquired the workspace sheduling rule
    && is trying to acquire the WorkManager lock
 - Another thread, in this case main, is trying to acquire a resource
scheduling rule, but the main thread is apparently the thread holding the
workspace lock...

According to the debugger, the WorkManager lock is currently held by the main
thread. I can't see how that can't be the case, as the main thread isn't doing
any work checked into the workspace.

Backtraces and screenshots attached.
Comment 1 James Blackburn CLA 2009-11-11 13:01:34 EST
Created attachment 151954 [details]
metadata log

Hmmm.  Just saw this in the metadata log immediately before the deadlock occurs...

It looks like something bad's happening in the UI leading to a JFace runtime error (which may very well be my fault...). 

Shouldn't the workspace lock be released to on failures such as this to prevent Eclipse locking up like this?
Comment 2 James Blackburn CLA 2009-11-12 04:17:02 EST
Created attachment 152020 [details]
build waiting on main

Image showing build waiting on main.
Comment 3 James Blackburn CLA 2009-11-12 04:18:31 EST
Created attachment 152022 [details]
blocking job is AutoBuild
Comment 4 James Blackburn CLA 2009-11-12 04:40:48 EST
I had a go writing some unit tests to provoke this -- replicating the exceptions in the metadata log.  I've been unsuccessful in provoking a lockup so far :(

Will see if I can reproduce with locks and job tracing on.
Comment 5 James Blackburn CLA 2009-11-12 08:02:59 EST
Created attachment 152031 [details]
Reproduction 2

Can reproduce it reasonably reliably - though not yet with lock tracing on.  

In the attached stack, main is in the normal display loop. However workmanager lock is apparently still owned by it (orderedLoc.currentOperationThread()).
Comment 6 James Blackburn CLA 2009-11-12 13:24:12 EST
Created attachment 152084 [details]
Test 1

After much gnashing of teeth. I cached the stack traces acquire the workspace lock in Ordered Lock and noticed that depth is neither volatile (though this is pretty meaningless in Java 1.4) nor is it always synchronised.

I added some assertions that the thread modifying depth was currentThread, and out popped LockManager#addLocakWaitThread(...) -> LockState#suspend(...) -> OrderedLock#forceRelease(...)

The attached JUnit is the result of this insight: create a deadlock condition between UI and Resources, and wait until the workspace locks up. Usually 10 - 100 iterations of my outer loop.

This doesn't look thread-safe to me.

John / Szymon the JavaDoc on ordered lock states:
" * The lock avoids circular waiting deadlocks by detecting the deadlocks
 * and resolving them through the suspension of all locks owned by one 
 * of the threads involved in the deadlock. This makes it impossible for n such 
 * locks to deadlock while waiting for each other..."

Any glimmers of insight?

Crashing patch attached.
Comment 7 James Blackburn CLA 2009-11-12 13:25:30 EST
Bumping to major as it locks up the IDE without any indication of what went wrong.

If it is deadlock detection to blame then it's likely API-user error, but at the very worst core.resources should fail gracefully.
Comment 8 James Blackburn CLA 2009-11-12 13:26:32 EST
Created attachment 152085 [details]
Test 1 patch

Completely wrong patch attached previously.
Comment 9 John Arthorne CLA 2010-03-30 13:22:23 EDT
Created attachment 163428 [details]
Updated test

This moves the test into the UI test bundle where it has access to SWT API.
Comment 10 John Arthorne CLA 2010-03-30 13:24:51 EDT
I am unable to reproduce the deadlock with this test, but I suspect this is a duplicate of bug 307282. Our implementation of syncExec calls Thread.interrupt() on the UI thread in order to get a blocked UI thread to process syncExec requests. So, this could easily result in the broken lock state described in bug 307282. However, it would be useful to have a reproducible test to verify this. Note I tried running the test both with and without the fix for bug 307282 but it didn't deadlock in either case. James, can you still reproduce this?
Comment 11 James Blackburn CLA 2010-03-30 14:07:52 EDT
Created attachment 163433 [details]
backtrace

(In reply to comment #10)
> So, this could easily result in the broken lock
> state described in bug 307282. 

I think you're right, the manifestation of this bug - orphaned locks - is identical to what I was seeing when the user first encountered the problem. I haven't been able to reproduce this test and 3.6 at work either...  

However just running the test here on my mac (with a clean core.resources & core.jobs) I'm getting a reliable:
"Attempted to beginRule: R/, does not match outer scope rule: L/testProject/foo"
Comment 12 John Arthorne CLA 2010-03-30 14:29:27 EDT
(In reply to comment #11)
> However just running the test here on my mac (with a clean core.resources &
> core.jobs) I'm getting a reliable:
> "Attempted to beginRule: R/, does not match outer scope rule:
> L/testProject/foo"

This is a bug in the test. It attempts to do refreshLocal on workspace root from within a syncExec. If that syncExec occurs in the UI thread while the UI thread holds a different lock, you'll get this failure. Here is the trace:

Thread [main] (Suspended (exception IllegalArgumentException))	
	owns: ImplicitJobs  (id=858)	
	owns: RunnableLock  (id=859)	
	Assert.isLegal(boolean, String) line: 63	
	ThreadJob.illegalPush(ISchedulingRule, ISchedulingRule) line: 136	
	ThreadJob.push(ISchedulingRule) line: 326	
	ImplicitJobs.begin(ISchedulingRule, IProgressMonitor, boolean) line: 63	
	JobManager.beginRule(ISchedulingRule, IProgressMonitor) line: 285	
	WorkManager.checkIn(ISchedulingRule, IProgressMonitor) line: 117	
	Workspace.prepareOperation(ISchedulingRule, IProgressMonitor) line: 1914	
	WorkspaceRoot(Resource).refreshLocal(int, IProgressMonitor) line: 1655	
	Bug_294894$6.run() line: 156	
	UILockListener.doPendingWork() line: 164	
	UILockListener.aboutToWait(Thread) line: 126	
	LockManager.aboutToWait(Thread) line: 123	
	OrderedLock.doAcquire(Semaphore, long) line: 157	
	OrderedLock.acquire(long) line: 109	
	OrderedLock.acquire() line: 83	
	WorkManager.checkIn(ISchedulingRule, IProgressMonitor) line: 118	
	Workspace.prepareOperation(ISchedulingRule, IProgressMonitor) line: 1914	
	File.appendContents(InputStream, int, IProgressMonitor) line: 47	
	Bug_294894.testDeadLockFail() line: 191
Comment 13 James Blackburn CLA 2010-03-30 14:42:13 EDT
(In reply to comment #12)
> (In reply to comment #11)
> > However just running the test here on my mac (with a clean core.resources &
> > core.jobs) I'm getting a reliable:
> > "Attempted to beginRule: R/, does not match outer scope rule:
> > L/testProject/foo"
> 
> This is a bug in the test. It attempts to do refreshLocal on workspace root
> from within a syncExec. If that syncExec occurs in the UI thread while the UI
> thread holds a different lock, you'll get this failure. Here is the trace:

Right, that's interesting. Presumably this can happen anytime:
 UI: blocked while holding a resource rule
 Job: does syncExec which tries to grab a conflicting / non-contained  scheduling rule.

even though the api user believes they're acquiring locks in the same order -- i.e. UI lock first. This seems counter-intuitive to me, shouldn't the original scheduling rule be release while doing pending work?
Comment 14 John Arthorne CLA 2010-03-30 17:10:02 EDT
(In reply to comment #13)
> even though the api user believes they're acquiring locks in the same order --
> i.e. UI lock first. This seems counter-intuitive to me, shouldn't the original
> scheduling rule be release while doing pending work?

What we do now is effectively "transfer" the lock into the syncExec from the caller. So, the syncExec can modify the same things that the caller of the syncExec could modify. I.e., it's as if the body of the syncExec was occurring in the thread that invoked the syncExec.

I don't think releasing the lock while calling syncExec is a safe solution - it would mean that any other thread could obtain that scheduling rule and therefore violate the thread safety that scheduling rules are intended to provide.
Comment 15 James Blackburn CLA 2010-03-30 19:33:21 EDT
Created attachment 163463 [details]
test take 3

(In reply to comment #14)
> What we do now is effectively "transfer" the lock into the syncExec from the
> caller. So, the syncExec can modify the same things that the caller of the
> syncExec could modify. I.e., it's as if the body of the syncExec was occurring
> in the thread that invoked the syncExec.

AFAICS that's not what's happening here -- the confusion might be that there are two separate jobs being scheduled - both called j  *ugh*. I've renamed the 'inner' job j2, and corrected indentation.

main:
        File scheduling Rule acquired to modify file, attempting Workmanager ILock
  	at org.eclipse.core.internal.resources.File.appendContents(File.java:47)
	at org.eclipse.ui.tests.concurrency.Bug_294894.testDeadLockFail(Bug_294894.java:190)

Job j2: 
        Not run with any scheduling rules, attempts sync exec.

main: before waiting for the Workmanager lock processes 'pendingWork.
	at org.eclipse.core.internal.resources.Resource.refreshLocal(Resource.java:1655)
	at org.eclipse.ui.tests.concurrency.Bug_294894$6.run(Bug_294894.java:155)
	at org.eclipse.ui.internal.UILockListener.doPendingWork(UILockListener.java:164)
	at org.eclipse.ui.internal.UILockListener.aboutToWait(UILockListener.java:126)
	at org.eclipse.core.internal.jobs.LockManager.aboutToWait(LockManager.java:123)

Job: 
        j2 in the sync exec runs, attempts resource changing operation
	at org.eclipse.core.internal.resources.Resource.refreshLocal(Resource.java:1655)
  => syncExec
	at org.eclipse.ui.tests.concurrency.Bug_294894$6.run(Bug_294894.java:155)

The syncExec / Job does not itself hold any IScheduling rules. The UI thread has already acquire the file sched rule when it went into the doPendingWork. Hence the subsequent failure.  The problem I see is that this code:
Job j2 = new Job("Workspace refresh") {
	protected IStatus run(IProgressMonitor monitor) {
		Display.getDefault().syncExec(new Runnable() {
			public void run() {
				try {
					workspace.getRoot().refreshLocal(IResource.DEPTH_INFINITE, null);
				} catch (Exception e) {
					fail("refresh run", e);
				}
			}
		});
		return Status.OK_STATUS;
	}
};
j2.schedule();

doesn't have any control/knowledge over what scheduling rules the UI thread already owns.

> So, the syncExec can modify the same things that the caller of the
> syncExec could modify. I.e., it's as if the body of the syncExec was occurring
> in the thread that invoked the syncExec.

This would have worked correctly without the syncExec as the the root rule acquire would have just blocked.  The problem seems to be that the syncExec holds the scheduling rule (and locks?) of whatever's blocked and processing pendingWork.

I suspect it wouldn't be too hard to write a test that demonstrated this (without the massive loop), perhaps:
  - attach a resource change listener
  - Workspace.run( null_rule) { modify file }
  - resource change listener should block (workspace lock held), no scheduling rules
  - attempt a resource change operation on the UI thread
  - {a}sync exec in a job which contains another scheduling rule acquire -- either non-contained resource rule, or a non-resource rule.
Comment 16 James Blackburn CLA 2010-03-30 19:35:05 EDT
As for the original issue, I've been unable to reproduce it in 3.6, so I'd happy to close this bug and possibly file another if you agree the underlying behaviour of sched rules in the UI thread is suspect.
Comment 17 John Arthorne CLA 2010-04-06 14:35:11 EDT
Closing the original problem as a duplicate.

I don't see a problem with the current interaction of syncExec and locks but you're welcome to enter a bug report if you think you see a problem. Keep in mind that whole system is designed to handle the trivial deadlock case where thread A owns a lock and performs syncExec, and Thread B is the UI thread waiting for the lock held by A. There are certainly more complex lock interactions that we cannot safely resolve automatically, and it's up to the client code to avoid unsafe practices (like attempting to acquire arbitrary locks in a syncExec where you don't know the lock state of the thread you are going to run in).

*** This bug has been marked as a duplicate of bug 307282 ***