| Summary: | [Widgets] Deadlock while UI thread displaying/computing a change proposal and non-UI thread creating image | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | Stephan Herrmann <stephan.herrmann> | ||||||
| Component: | SWT | Assignee: | Platform-SWT-Inbox <platform-swt-inbox> | ||||||
| Status: | CLOSED WONTFIX | QA Contact: | |||||||
| Severity: | major | ||||||||
| Priority: | P3 | CC: | carsten.pfeiffer, daniel_megert, ericwill, irbull, markus.kell.r, pinnamur, Silenio_Quarti, steffen.pingel | ||||||
| Version: | 3.6 | Keywords: | triaged | ||||||
| Target Milestone: | --- | ||||||||
| Hardware: | PC | ||||||||
| OS: | Linux-GTK | ||||||||
| Whiteboard: | stalebug | ||||||||
| Bug Depends on: | |||||||||
| Bug Blocks: | 312189 | ||||||||
| Attachments: |
|
||||||||
|
Description
Stephan Herrmann
For now simply use 'null' as the image. This looks like an SWT issue where we wait on a UI lock in the non-UI thread. I think creating an image in the non-UI thread should be OK and not block. SWT team: maybe the lock is hold because we call Shell.setEnabled(false), but again, it should still be able to create a new image in a non-UI thread. > SWT team: maybe the lock is hold because we call Shell.setEnabled(false), but > again, it should still be able to create a new image in a non-UI thread. That was the other change for bug 195834, but on closer look, I don't think that's the issue here. The problem is really that the main loop (OS.g_main_context_iteration(..)) takes the global lock, and the constructor of Image assumes the lock will become available eventually. But the main loop is now blocked since we had to make ChangeCorrectionProposal.getChange() synchronized. This is actually quite a severe problem. We only didn't see this earlier because the images we use in JDT quick fix popups are pre-initialized in the image registry. The general pattern is: - main thread holds org.eclipse.swt.internal.Platform.lock - non-UI thread holds a monitor - main thread tries to acquire the same monitor (but can't, so it waits) - non-UI thread calls a constructor of Image, which tries to acquire Platform.lock (but can't, so it waits) => deadlock This problem can only be prevented at the client side if we stipulate that new Image(..) may only be called in a non-UI thread if the thread does not hold any monitors that could also be requested in the UI thread. But that would be quite hard to guarantee in existing code. In the end, it would not allow the use of Java object monitors in any case where anywhere in the execution stack, an image is created in a non-UI thread. I do not think there is a fix for this in the SWT side. I believe the only option here is for client code to create the image in the ui-thread or to avoid doing synchronization between the UI thread and other threads. *** Bug 309182 has been marked as a duplicate of this bug. *** *** Bug 302079 has been marked as a duplicate of this bug. *** I thought this wouldn't happen in the SDK, but bug 309182 proved that we need to fix this for 3.6. The trace in that dump shows an example where we compute images on demand. Pre-caching all the images would be possible for that case, but I doubt it's the last one. Bug 276438 also looks like the same problem. The last call on this issue was WONTFIX in bug 236149. If that's going to stay the answer, then this must be documented in Image and needs an entry in the Readme. Even then, clients won't be able to fix this everywhere, since that breaking change ripples through a huge number of APIs (basically every method is affected that could create an Image and take a lock *anywhere in the call stack*). So at least the constructors of Image must be protected such that they detect image creations in a non-UI thread and throw an illegal thread access exception instead of deadlocking. We cannot start throwing an exception in the image constructors. This is a breaking API change. Applications that do not perfom synchronization between the UI thread and the thread creating the image are still valid (and work). We could add warnings in the java doc saying that causing the UI thread to wait in a look may cause deadlocks. Created attachment 166018 [details] Workaround for JDT/UI I proposed to detect the problematic cases and only throw an exception in those cases, but I guess that's not possible: Thread.holdsLock(..) could be used to detect whether the UI thread holds the Platform.lock, but there's no way to detect (e.g. in the constructor of Image) whether a given thread holds any lock that the UI thread is waiting for. We can't ship 3.6 with a potential deadlock, so I've released this workaround to HEAD. It opens bug 195834 again for GTK, but that's much less severe than risking a deadlock. I still think this should be fixed in SWT. We can't detect the problem in SWT, and it's almost impossible to ensure that client code never holds a lock and waits for it in the UI thread. So the only solution I see is to avoid using 'synchronized' to implement the Platform.lock on GTK. I don't have a silver bullet, but a solution could be: - detect that an Image is created in a non-UI thread - in that, case try to take the lock, and if you don't get it in the next 5 seconds, assume that something's wrong and throw an exception. That's a heuristic, but it's still better than risking a deadlock. (In reply to comment #9) Thanks Markus, If you tag this for tonight's IBuild (8:00pm EST) then I'll hammer on content assist for GTK during the test pass tomorrow. (In reply to comment #10) > If you tag this for tonight's IBuild (8:00pm EST) then I'll hammer on content > assist for GTK during the test pass tomorrow. It's released for I20100425-2000, but there's not much to test via "hammering" ;-). The deadlock situation is well understood. In JDT proposals, it happens when an image for a Java element is created in a background thread. Those composite images are cached when they are first created, so this most often happens shortly after starting Eclipse. To provoke the problem you need exact timing or breakpoints in the right places, and you need to use an example where the image is used for the first time in the non-UI thread. I'll test this during the M7 test pass. The downside of my fix is that it reverted bug 195834 on GTK, so if you try hard (apply the quick fix exactly at the time when the additional info hover with the preview is computed), you can see the BadLocationExceptions from bug 195834 again. Anyway, thanks for adding this comment! It made me look at my fix again, and I discovered that I forgot to remove the 'synchronized' modifier from the getChange() method. I've fixed this for I20100425-2000. Verified in I20100425-2000 that the workaround in ChangeCorrectionProposal
#getChange() "works" (i.e. just throws an exception with the steps below).
Steps to reproduce the deadlock without the workaround:
- set method entry breakpoint in ChangeCorrectionProposal#getChange()
- disable Prefs > Debug > Activate workbench when breakpoint is hit
- launch clean runtime workbench
- paste this snippet:
public class C {
public static void main(String[] args) {
new Cloneable() {};
}
}
- set caret before "new" (make sure you don't hover over "Cloneable", since that would destroy the effect)
- press Ctrl+1
- wait at least 1 second
- press Enter
- in host workbench, resume all suspended threads
=> deadlock
Created attachment 166061 [details] Patch using SWT internals to release lock There's another solution: If we don't want to guess when the problem could happen, then we could put this burden on the client, but provide a way for the client to resolve the problem. The idea is to protect code that runs in the UI thread and that needs to synchronize, such that the GTK lock is released at the time the UI thread enters the synchronized block. This pattern is already used in Display#sleep(). AFAICS, that's the reason why "new Image(..)" can work at all in a non-UI thread. In the attached patch, I've copied that code, and it worked nice (does not deadlock, and properly synchronizes such that bug 195834 doesn't happen). But I can't refer to internals. If you could offer this as API, e.g. Display#runWithoutOSLock(Runnable), then client code that knows it can run into this problem can use this API and make sure it only synchronizes in the Runnable, but not outside it. (In reply to comment #11) > The downside of my fix is that it reverted bug 195834 on GTK, so if you try > hard (apply the quick fix exactly at the time when the additional info hover > with the preview is computed), you can see the BadLocationExceptions from bug > 195834 again. FWIW, yes, in M7 bug 195834 is alive. Even without trying very hard I just saw it happen again during normal operation. (In reply to comment #14) > FWIW, yes, in M7 bug 195834 is alive. Even without trying very hard I just > saw it happen again during normal operation. According to comment 13 this should not happen. Please file a new bug report against JDT UI. (In reply to comment #15) > (In reply to comment #14) > > FWIW, yes, in M7 bug 195834 is alive. Even without trying very hard I just > > saw it happen again during normal operation. > > According to comment 13 this should not happen. AFAICS the patch from comment 13 hasn't been committed yet. So I was more like encouraging Markus to keep going with that :) (In reply to comment #13) > [..] But I can't refer to internals. > [..] > If you could offer this as API, e.g. Display#runWithoutOSLock(Runnable), then > client code that knows it can run into this problem can use this API and make > sure it only synchronizes in the Runnable, but not outside it. I can't release this patch, since it refers to internals from SWT. If the SWT team agrees with the approach, I can implement the proposed API. I'm confident that the PMC will approve the API addition, since it's the only solution we have that prevents deadlocks (a blocker) and bug 195834 (a major inconvenience with many dups). I agree with the previous comment. However, we should either have a new bug for bug 195834 or reopen it, since it is indeed not fixed on Linux. (In reply to comment #17) > > If you could offer this as API, e.g. Display#runWithoutOSLock(Runnable), > > sure it only synchronizes in the Runnable, but not outside it. > I can't release this patch, since it refers to internals from SWT. If the SWT > team agrees with the approach, I can implement the proposed API. I'm confident > that the PMC will approve the API addition, since it's the only solution we > have that prevents deadlocks (a blocker) and bug 195834 (a major inconvenience > with many dups). Is this what you have in mind? void runWithoutOSLock (Runnable runnable) { Lock lock = Platform.lock; int count = lock.lock (); for (int i = 0; i < count; i++) lock.unlock (); try { runnable.run(); } finally { for (int i = 0; i < count; i++) lock.lock (); lock.unlock (); } } I do not think the API make sense. The OS lock is needed to avoid multiple threads in OS libraries which are not thread-safe. If we give such API, we are basically opening the lock up without any control. I believe this would defeat its purpose and we would start getting random crashes. Display.sleep() is the only place where the UI thread can safely leave the lock and let other threads call OS libraries because we know there is nothing else happening in the UI thread. Yes, that's about what I had in mind, with the important difference that the first line should be:
checkDevice ();
I agree that all hell breaks loose when clients can do that from any thread, but is it also a problem when they are in the UI thread? It's similar to what happens when a client decides to call Display#sleep(), but with the difference that no other events are being processed when #runWithoutOSLock(Runnable) is called.
The usage of the Platform.lock ensures that the UI thread only proceeds when it got the lock back, but while the Runnable is executed, the lock is free for another process to take (and release). The constructor of Image needs the lock, and if it's called while the UI thread is inside the runnable of Display#runWithoutOSLock(Runnable), then it can do its business without being locked.
I've filed bug 312189 for the BadLocationException from bug 195834 that can now happen again on GTK. *** Bug 313739 has been marked as a duplicate of this bug. *** Reducing severity since there is a workaround in JDT. I must admit I lost track what part is fixed and what remains. Is it safe now to create images during ChangeCorrectionProposal.getChange()? (In reply to comment #24) > Is it safe now to create images during ChangeCorrectionProposal.getChange()? Yes. For 3.6, we removed the deadlock but reopened the door for BadLocationExceptions on GTK. Since 3.6.2, the workaround from bug 312189 also fixes that for the ChangeCorrectionProposal. However, the general problem outlined in comment 3 is still present (that's the subject of this bug). (In reply to comment #25) Thanks for clarifying! Looks like there is a patch pending for this but the bug is not resolved. Moving to triaged. This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're closing this bug. If you have further information on the current state of the bug, please add it and reopen this bug. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. -- The automated Eclipse Genie. |