| Summary: | Spin loops in JobManager causing lock contention | ||
|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | Min Idzelis <min123> |
| Component: | Runtime | Assignee: | platform-runtime-inbox <platform-runtime-inbox> |
| Status: | CLOSED WONTFIX | QA Contact: | |
| Severity: | normal | ||
| Priority: | P3 | CC: | john.arthorne, rstheo |
| Version: | 3.4.1 | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Windows XP | ||
| Whiteboard: | stalebug | ||
|
Description
Min Idzelis
The current code is a hybrid of wait/notify and polling wait. There is a notification when the lock is available (ImplicitJobs.notifyWaitingThreadJobs). It was originally a pure wait/notify system, but ran into problems with re-entrant waits. Essentially while waiting the event loop can run, which can potentially result in a re-entrant wait (either on the same lock or a different one). This made the wait/notify solution very complex and I was never able to make it entirely free of deadlocks. Do you have some evidence that 4 additional lock synchronizations a second produces a non-trivial performance impact? I would like to use this bug to track the patch I'll be creating to remove the time-wait in Thread.joinRun().
So far, my experiments have been successful. The only change I needed to do was to add a notify() within the ImplicitJobs' transfer() method.
synchronized (ThreadJob.notifier) {
ThreadJob.notifier.notifyAll();
}
One problem I see right away is supporting cancellation of beginRule(). Unfortunately, the IProgressMonitor only provides a polling-based cancellation. (isCanceled()) Our goal is to not have the ThreadJob#joinRun() wake up unnecessarily because it acquires locks that causes lock contention. If we don't wake up occasionally, we won't discover that we were canceled.
My proposal is to allow the non-timed wait to support interruption. When interrupted, it will throw an OperationCanceledException. Whenever a thread enters a joinRun, an independent cancellation polling thread will start. (A single thread can handle all thread jobs). When the poll thread detects that the monitor is it monitoring is canceled, it will interrupt() the waiting thread job in joinRun().
This is a trade off. There will be one additional thread running whenever there are threads blocked in beginRule() for the benefit of reduced contention.
John, what do you think of this design?
I have to say I'm very reluctant to change this unless the current behaviour is actually buggy or a severe performance problem. Before switching to a timed wait this code was a constant source of deadlocks/livelocks due to inadequate notification of the waiting monitor. Note there are several exit criteria for this while loop and they all need to be responded to in a timely manner. This includes cancelation, and also the LockListener can decide to grant immediate access (this is used to handle deadlocks between Display.syncExec and waiting rules). I know the current code is not optimal but we have to be mindful that a single case of failing to wake up the sleeping monitor can kill an application. This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're closing this bug. If you have further information on the current state of the bug, please add it and reopen this bug. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. -- The automated Eclipse Genie. |