| Summary: | deadlock: a simple scenario | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | Yasser Lulu <ylulu> | ||||
| Component: | Runtime | Assignee: | John Arthorne <john.arthorne> | ||||
| Status: | RESOLVED INVALID | QA Contact: | |||||
| Severity: | normal | ||||||
| Priority: | P3 | CC: | dirk_baeumer, dmisic, markus.kell.r, Tod_Creasey, zina | ||||
| Version: | 3.1 | ||||||
| Target Milestone: | --- | ||||||
| Hardware: | PC | ||||||
| OS: | Windows XP | ||||||
| Whiteboard: | |||||||
| Bug Depends on: | |||||||
| Bug Blocks: | 108162 | ||||||
| Attachments: |
|
||||||
This deadlocks in 3.1? I thought we handled this case... I just looked at NestedSyncExecDeadlockTest in our test suites and we handle some subtle variants of this. I'll investigate... This is the details of the Eclipse I'm using: Eclipse SDK Version: 3.1.0 Build id: I20050617-1618 The JVM I'm using is: Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_08-b03) From my first glance at this bug, I was confusing it with the case of: 1) non-UI thread gets workspace lock 2) non-UI thread does a syncExec 3) workspace lock is acquired in the sync exec. This is the case that we handle currently. In your situation, it is a more general case of control being transferred to another thread. We do have a mechanism to make this work, and support for it is built into ModalContext. If the operation passed to the ModalContext implements the interface org.eclipse.jface.operation.IThreadListener, it will be notified when control is about to pass to another thread. This is an opportunity to transfer the scheduling rule owned to the modal context thread. See bug 80648. Does this suite your requirements? There is "standardized arbitration" between core locks (ILock, ISchedulingRule), and the UI thread (sync exec essentially locks the UI thread). However, in this case the other lock is a plain Java thread being created and the caller blocks on its completion (Thread.join()). There really isn't any way for the platform to automatically arbitrate lock contention in these situations. The actual deadlock that was occurring in the user scenario was more complex: 1) Refactoring starts in the UI thread. Workspace lock is taken, and a modal context thread is forked (call this modal context MC1). 2) Refactoring transfers the workspace lock to MC1 using IJobManager.transferRule 3) Refactoring operation runs in MC1, causing various side-effects. One side-effect is another plugin schedules an asyncExec. 4) MC1 passes ownership of workspace lock back to UI thread, and exits 5) After passing control back to the UI thread, but before MC1 dies, the asyncExec is run. 6) The asyncExec forks another model context (MC2), and blocks the UI thread in another event loop. 7) MC2 tries to acquire the workspace lock and deadlocks because the workspace lock is currently held by the UI Thread. There are two problematic pieces of code that lead to the deadlock. The first is the refactoring infrastructure, which runs the event loop at a time when it owns the workspace lock. This means that any asyncExec running at that moment that tries to get the same lock will deadlock. The second piece of code is the asyncExec itself, which is indirectly acquiring the workspace lock from a callback: a context where it cannot know what locks are currently owned by the UI thread. Both of these code patterns should be avoided if at all possible. John, the problem you descibed regarding the refactoring infrastructure isn't caused by refactoring. It is the general way how ModalContext and progress reporting works. The event loop is run by the modal context infrastructure to ensure that clients can press the cancel button. Allmost all UI actions follow this pattern and therefore cause the same situation. Following your suggestions no runnable executed via IRunnableContext.run (for example via the ProgressMonitorDialog) could ever lock the workspace since the runnable context will spin the event loop resulting in any (a)sync runnable to be executed. However I agree that the behaviour is problematic in regards to deadlocks. We are fighting the problem of the "free event" loop and the possibility that any async or sync exec can run during a modal context thread since version 1.0 ;-) We discussed several solutions to this in the past. Here is what I can remember from the discussions: - have an extra event queue in core. Most of the time clients execute an async exec because they want to run after the currently executed workspace runnable. The only way how to achieve this currently is to use async exec. - like executing a workspace runnble executed a runnable via the Display should take a scheduling rule allowing the infrastucture the detect these case. This would mean that all async exec which for example aquire the workspace lock would wait until the modal context thread is finish. - disallow locking the workspace or any parts of it inside a display runnable. Created attachment 26174 [details]
JUnit test to reproduce the problem
I have created this JUnit test that illustrates the problem being discussed.
The test case in the original bug report was a bit too simple. In that case it
is easier to argue that the deadlock is expected.
In this JUnit test, you would not expect a deadlock from looking at the code.
The second workspace operation occurs inside an asyncExec, so you would expect
the first operation to complete before the second one starts. However, since
the first ModalContext spins the event loop (and it fact forces the event queue
to be flushed before returning), this test deadlocks every time.
I don't have an answer yet, but it feels like we should be able to handle this
particular case. I think the real root of the deadlock is the nested
ModalContext threads (when this test is deadlocked, you see two modal context
event loops in the UI thread's stack trace). If ModalContext could avoid the
nesting from happening it would break the deadlock.
I think threr are three somewhat relevant issues comes to mind here, first is related to spinning the event-loop is sometimes is needed for dispatching system events (e.g., mouse-clicks, screen-paint) which means that executing asyncExec Runnables is not the intent of the caller, yet, the Display object only offers readAndDispatch that will run the asyncs, which as we can see can lead to some "unexpected" beahviour on the part of the calling client. Now, executing asyncRunnables is a legitimate behaviour and should stay the default for the message pump in Eclipse, besides some clients do need this behaviour when they explicitly call readAndDispatch, however some clients don’t, so, may be Eclipse should offer a different flavour for event flushing, one that dispatches system events only -no async runnables- this will give clients the option to pick and choose which flavour to call depending on their intent. The second issue is that Eclipse does not specify a clear order-of-acquisition of its globally available locks, which is the only sure way to avoid deadlocks of any kind, currently, some clients acquire the UI -virtual- lock and then proceed to acquire Workspace lock, while others do exactly the opposite: acquire the Workspace lock then the UI lock, all of this works fine if these two clients are not executing in the same time, but once they do -as is the case of these two scenarios -the simplistic one or the more involved one- deadlock becomes inevitable. I suggest that some order of lock acquisition is needed here, meaning: you can't acquire lock B and then try to acquire lock A, you only can acquire A and then B, or acquire B alone without going back in the other direction and acquire A. Giving clients a clear-cut direction is important in deciding who is responsible for what whether it is Eclipse vs. clients or a client plugin vs. another client code. Third, it seems that Eclipse introduced the IThreadListener interface to handle lock ownership transfer, however, the problem I find is that this technique is "partially" supported, meaning, it is not supported in every thread switching context offered by Eclipse, for example, it is supported by the ModalContext object before it forks a thread whereas it is not supported by the Display object when it runs syncExecs although a thread switching is taking place here -between the user-thread that called syncExec and the UI thread. If lock transfer is supported in all places, then clients can change their code to take advantage of that and avoid both deadlock scenarios we have here. Re comment #5: > Almost all UI actions follow > this pattern and therefore cause the same situation. Following your suggestions > no runnable executed via IRunnableContext.run (for example via the > ProgressMonitorDialog) could ever lock the workspace since the runnable context > will spin the event loop resulting in any (a)sync runnable to be executed. This isn't quite right. In the "normal" case, the UI thread forks a ModalContext thread and only the ModalContext thread acquires the workspace lock. If the UI thread runs an asyncExec while it is waiting, and that async acquires the workspace lock, it will not cause a deadlock. The modal context operation will eventually complete and release the lock, and then the async will complete. The difference in the refactoring case is that when the modal context operation completes, it does not release the lock but instead transfers the lock back to the UI thread. Any other background thread waiting for this lock (such as another modal context), will not get a chance to run. I know this is all happen by design, but this coding pattern used by refactoring will cause deadlocks in cases where "normal" modal context operations do not. >> I know this is all happen by design, but this coding pattern used by
>> refactoring will cause deadlocks in cases where "normal" modal context
>> operations do not.
Agree that this can cause a deadlock in situations where the UI thread is
blocked by a (a)sync runnable waiting for a lock different than the workspace
lock. In all other cases there is no deadlock since when refactoring transfers
the rule back to the UI thread all they UI thread is doing is releasing the lock
right away.
I opend PR 107237 to investigate whether refactoring could release the lock in
the modal context thread instead of transferring it back to the UI thread.
Some comments to comment #7. First issue: that's exactly what we did in Visual Age Micro Edition. We filtered all non system event and executed them after the modal context thread has finished. The downside of this is that UI updating code thats reacts on model deltas in a non UI thread has to post the updating code into the UI thread. Since these are no system event they get filtered which leads to the fact that the UI updates after the modal context thread has finished which sometimes looks strange. Third issue: you can transfer scheduling rules when using Display.syncExec. You might want to have a look at UIPerformChangeOperation#executeChange (In reply to comment #10) > Third issue: you can transfer scheduling rules when using Display.syncExec. You > might want to have a look at UIPerformChangeOperation#executeChange The problem with the code in the UIPerformChangeOperation is that the lock have transferred before the runnable runs, meaning, while the calling thread is still blocked on the Disaplay.syncExec to run, it doesn't own the lock anymore which means someone else that is currently running in the UI thread will have a chance to acquire the lock that just got transferred to the UI and this could lead to all kinds of issues. So I think the IThreadListener is the way to go to transfer locks -just when running the runnable and not before- and then transfer it back to tha caller. Another scenario (not so uncommon) that cause deadlocks (from bug 108162): - aquire a lock in the UI thread - execute an operation (e.g. MoveFilesAndFoldersOperation) which spawns a modal context thread which aquires a lock as well *** Bug 108162 has been marked as a duplicate of this bug. *** Since the bug #108162 has been marked as duplicate of this bug (and I am not entirely convinced that it is) I will re-post my question here: Is there any news on this? I have not seen any Bugzilla activity on this bug (nor the 105491) in a while and was wondering what the plans are since it is really important for us. The best fix I can think of is to cause exceptions to be thrown when these deadlock prone patterns are used. This would ensure applications in the field don't deadlock, and force client code away from dangerous things like waiting on a thread while you own a scheduling rule. For example, we could throw an exception when: 1) A rule is transferred to a thread that is waiting for an unrelated rule. 2) A WorkspaceModifyOperation is forked from the UI while the UI owns a scheduling rule. OK, the last comment I can agree with - to a certain degree. It is also the reason why I said that I am not entirely convinced that 108162 is a duplicate of this. John already reopened the bug 108162 and I think it was the right thing to do. As for this bug (105491), the exception would be improvement. At the least the application would not be dead and the user will not have to restart. I think it is a valid approach and should be done. On the other hand, I do believe that there should be (much) more effort on the Eclipse side to enable existing APIs to work together and not get into the deadlock situations. By that I do not mean the ModalContext code / the thread mgmt code but rather its clients. We adopted Eclipse 3.1 and we are seeing a lot of deadlocks. Some of them were clearly caused by some practices in our code and we are trying to prune those. But there is also a number of them arising from using Eclipse APIs that do not indicate in any way what may or may not be used with them and still cause deadlock as in case of 108162. We can see that the RefactoringWizardDialog2 has an option to fork or not (on ModalContext) but that class is internal. The public class RefactoringWizardOpenOperation does not expose this flag but that is what is available to client code. So, either the public classes have to do more to give proper controls to their clients or they have to handle those themselves. The current situation is that lots of public APIs obtain those locks, don't expose or document any of that and do not handle the problems. Dusko, I don't understand your comments about the RefactoringWizardDialog2 and RefactoringWizardOpenOperation. None of these two classes provide API to control the fork mode. Are you referrring to the run method in RefactoringWizardDialog2, which has an fork argument ? This method is an implementation of IRunnableContext and necessary to use the RefactoringWizardDialog2 to execute runnable with progress (basically to show progress inside of the dialog). This implementation is basically the same as for the WizardDialog. However, I agree that the RefactoringWizardOpenOperation should document that it takes the workspace lock when executing a refactoring and that this lock is held in the right threads when the change are created and executed for both participants and processors. I added such a comment to the class. It is just an example. All I am saying is that RefactoringWizardDialog2 is an internal class being used by the framework and it has an option to fork the thread (in ModalContext). Yes, I know it is dictated by the interface but that is beyond the point - the client code never executes that directly and it is the framework that decides if the thread is going to be forked or not. Now, value for that parameter is important - it could either prevent or cause deadlock but the client code cannot control it. My point is that since the client cannot control how the ModalContext is used (actually, it does not even know that ModalContext is going to be used) the burden cannot be on the client code to deal with the deadlock. I can see only 3 appropriate approaches: 1) The API that does not expose any controls on what and how is going to be locked and does not document it should deal with deadlock on its own 2) The API could give more control to the client code 3) The API could document what kind of locks it is going to obtain and/or how it deals with transfers My point is that it cannot be left to the client code to hunt down hidden dependencies. I agree that we either have to give more control (where appropriate) or document this better. The refactoring framework forks the execution into the modal context thread to support cancelation of the refactoring execution. I am really reluctent to give clients control over this since not forking will raise a new set of problems. And more important: even if the refactoring would be executed with fork=false, executing the MoveFilesAndFoldersOperation in a participant would result in the same deadlock. I updated the Javadoc to explain how refactoring deals with locks and threads. Any update on this bug ? It has not been targetted. If this is not to be resolved, we need a good documentation on how clients should use this to avoid the deadlock. I don't see any more action to take on this bug. The actions taken were: - Some specific cases that will certainly cause deadlock were changed to throw exceptions instead so clients can better detect and resolve these situations during development (tracked by bug 108162) - Dirk updated the refactoring javadoc to describe its use of locks. |
The following simplified code deadlocks, one can argue that it is expected and should be handled by the code author, however, when these snippets of code are triggered in response to Eclipse events by the various plugins that know nothing about each other, this issue becomes difficult -if not impossible- to resolve especially when there is no definite rules of enagagement specified by Eclipse platform nor there is a standardized arbitration when it comes to acquiring Eclipse global locks (e.g., workspace, ui): //Add the following Action to a context-menu new org.eclipse.jface.action.Action() { public void run() { try { ResourcesPlugin.getWorkspace().run(new IWorkspaceRunnable() { public void run(IProgressMonitor monitor) throws CoreException { try { //Launch a progress dialog with fork set to true new ProgressMonitorDialog(getShell()).run(true, true, new IRunnableWithProgress() { public void run(IProgressMonitor monitor) throws InvocationTargetException, InterruptedException { try { //Simply run a NOOP workspace runnable //Deadlock: the UI thread already //owns the workspace lock and it is //blocked on the progress monitor to //return, which in turn is witing for //its runnable executing in the forked //thread to retrun, which in trun is //waiting to acquire the workspace lock //held by the UI thread. ResourcesPlugin.getWorkspace().run( new IWorkspaceRunnable() { public void run( IProgressMonitor monitor) throws CoreException { //do nothing } }, null); } catch (CoreException ex) { ex.printStackTrace(); } } }); } catch (Exception ex) { ex.printStackTrace(); } } }, null); } catch (Exception ex) { ex.printStackTrace(); } } };