Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 295796

Summary: Deadlock in JobManager.yieldRule
Product: [Eclipse Project] Platform Reporter: John Arthorne <john.arthorne>
Component: RuntimeAssignee: John Arthorne <john.arthorne>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: min123
Version: 3.6   
Target Milestone: 3.6 M4   
Hardware: PC   
OS: Windows XP   
Whiteboard:
Bug Depends on: 283449    
Bug Blocks:    
Attachments:
Description Flags
Fix none

Description John Arthorne CLA 2009-11-21 23:41:41 EST
After investigating the transient failures in bug 295793 I ran YieldTest 10,000 times and encountered the following deadlock while running YieldTest.testYieldJobToThread:

Waiting thread is in ThreadJob.waitForRun inside:

synchronized (blockingJob.jobStateLock) {
	try {
		blockingJob.jobStateLock.wait();
	} catch (InterruptedException e) {
		interrupted = true;
		// Must break here to ensure aboutToWait() is called outside the lock.
		break;
	}
}

Yield job is here:

synchronized (unblocked.jobStateLock) {
	while (((ThreadJob) unblocked).isWaiting) {
		try {
			unblocked.jobStateLock.wait();
		} catch (InterruptedException e) {
			interrupted = true;
		}
	}
}

if the yielding job's state changes from RUNNING > WAITING before ThreadJob.jobStateJob enters the synchronized block, it will wait forever.  
Now since "unblocked" is stuck, the yielding job will also wait forever.

The wait in ThreadJob.joinRun doesn't follow the general wait pattern that the wait exit condition needs to be checked after entering the synchronized block and before falling asleep on the wait. Otherwise the wait condition could be satisfied before the synchronized block started and it waits forever.
Comment 1 John Arthorne CLA 2009-11-21 23:47:29 EST
Created attachment 152797 [details]
Fix

This fix checks that the blocking job is still running after entering the synchronized block and before falling asleep. This way if the blocking job terminates immediately prior to the sync block we will just loop and repeat the search for a blocking job.

Also, I think the wait() inside the yield implementation should be a timed wait. I.e., wait a reasonable amount of time for the unblocked job to run (maybe 1s), but don't wait forever. it's possible another job could slip in before the "unblocked" job gets a chance to run. We don't guarantee that a yield will block until at least one job completes, just that we will give another job a chance to run. To avoid this wait from setting up a deadlock condition I think we should put a limit on the amount of time we wait for the "unblocked" job to start running.
Comment 2 John Arthorne CLA 2009-11-21 23:57:07 EST
I have released the deadlock fix to HEAD.
Comment 3 John Arthorne CLA 2009-11-22 00:22:39 EST
After some more testing I have also released the switch to a timed wait in HEAD. 99.9% of the time this timeout will never come into play, but it prevents rare cases of three or more threads creating a deadlock due to circular wait. In this case we have the rare luxury of being able to timeout on a wait without breaking the contract of the method.
Comment 4 Min Idzelis CLA 2009-11-24 21:39:20 EST
(In reply to comment #1)

> Also, I think the wait() inside the yield implementation should be a timed
> wait. I.e., wait a reasonable amount of time for the unblocked job to run
> (maybe 1s), but don't wait forever. it's possible another job could slip in
> before the "unblocked" job gets a chance to run. We don't guarantee that a
> yield will block until at least one job completes, just that we will give
> another job a chance to run. To avoid this wait from setting up a deadlock
> condition I think we should put a limit on the amount of time we wait for the
> "unblocked" job to start running.

Would you mind explaining what you are attempting to prevent? I don't think the timed wait makes a difference at all. I'll explain. 

First, to clarify your statement, if yield() returns true, it means that not just any job - but a blocked job - was found and given a chance to execute. This means that it was either canceled or it actually ran. Unrelated jobs can be doing anything they want during this time. It's perfectly valid for other jobs can "slip in" before the unblocked job runs - even (very rare) other jobs that could have been considered blocked - but weren't (since they could have had the same rule as the blocked job, but were scheduled after the yield started). 

I don't see how not having a timed-wait will allow a deadlock to occur. There is no "hold-and-wait" condition, the only lock involved is Job.jobStateLock. The condition (job state) is checked under this lock, and can only be altered while the lock is held. This means that this wait must eventually be notified since all jobs must transition from wait to none or running. (Could it go back into blocked state? If so, this would be a good path to explore. Perhaps it should wait until it is not waiting AND not blocked?)