| Summary: | Deadlock in JobManager.yieldRule | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | John Arthorne <john.arthorne> | ||||
| Component: | Runtime | Assignee: | John Arthorne <john.arthorne> | ||||
| Status: | RESOLVED FIXED | QA Contact: | |||||
| Severity: | normal | ||||||
| Priority: | P3 | CC: | min123 | ||||
| Version: | 3.6 | ||||||
| Target Milestone: | 3.6 M4 | ||||||
| Hardware: | PC | ||||||
| OS: | Windows XP | ||||||
| Whiteboard: | |||||||
| Bug Depends on: | 283449 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Description
John Arthorne
Created attachment 152797 [details]
Fix
This fix checks that the blocking job is still running after entering the synchronized block and before falling asleep. This way if the blocking job terminates immediately prior to the sync block we will just loop and repeat the search for a blocking job.
Also, I think the wait() inside the yield implementation should be a timed wait. I.e., wait a reasonable amount of time for the unblocked job to run (maybe 1s), but don't wait forever. it's possible another job could slip in before the "unblocked" job gets a chance to run. We don't guarantee that a yield will block until at least one job completes, just that we will give another job a chance to run. To avoid this wait from setting up a deadlock condition I think we should put a limit on the amount of time we wait for the "unblocked" job to start running.
I have released the deadlock fix to HEAD. After some more testing I have also released the switch to a timed wait in HEAD. 99.9% of the time this timeout will never come into play, but it prevents rare cases of three or more threads creating a deadlock due to circular wait. In this case we have the rare luxury of being able to timeout on a wait without breaking the contract of the method. (In reply to comment #1) > Also, I think the wait() inside the yield implementation should be a timed > wait. I.e., wait a reasonable amount of time for the unblocked job to run > (maybe 1s), but don't wait forever. it's possible another job could slip in > before the "unblocked" job gets a chance to run. We don't guarantee that a > yield will block until at least one job completes, just that we will give > another job a chance to run. To avoid this wait from setting up a deadlock > condition I think we should put a limit on the amount of time we wait for the > "unblocked" job to start running. Would you mind explaining what you are attempting to prevent? I don't think the timed wait makes a difference at all. I'll explain. First, to clarify your statement, if yield() returns true, it means that not just any job - but a blocked job - was found and given a chance to execute. This means that it was either canceled or it actually ran. Unrelated jobs can be doing anything they want during this time. It's perfectly valid for other jobs can "slip in" before the unblocked job runs - even (very rare) other jobs that could have been considered blocked - but weren't (since they could have had the same rule as the blocked job, but were scheduled after the yield started). I don't see how not having a timed-wait will allow a deadlock to occur. There is no "hold-and-wait" condition, the only lock involved is Job.jobStateLock. The condition (job state) is checked under this lock, and can only be altered while the lock is held. This means that this wait must eventually be notified since all jobs must transition from wait to none or running. (Could it go back into blocked state? If so, this would be a good path to explore. Perhaps it should wait until it is not waiting AND not blocked?) |