| Summary: | Deadlock in WorkbenchErrorHandler.handle | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | Aurelien Pupier <apupier> | ||||||
| Component: | UI | Assignee: | Andrey Loskutov <loskutov> | ||||||
| Status: | VERIFIED FIXED | QA Contact: | |||||||
| Severity: | normal | ||||||||
| Priority: | P3 | CC: | daniel_megert, Lars.Vogel, loskutov, manoj.palat, sxenos | ||||||
| Version: | 4.6 | Flags: | sxenos:
review+
Lars.Vogel: review+ |
||||||
| Target Milestone: | 4.6.2 | ||||||||
| Hardware: | All | ||||||||
| OS: | All | ||||||||
| See Also: |
https://git.eclipse.org/r/81517 https://git.eclipse.org/r/81663 https://git.eclipse.org/c/platform/eclipse.platform.ui.git/commit/?id=4a17b8684b5f45f7fad8d36e904b17fff424011f https://git.eclipse.org/r/82388 https://git.eclipse.org/c/platform/eclipse.platform.ui.git/commit/?id=fade9db40d6ed0cfb263fa64206cbc89e3772217 |
||||||||
| Whiteboard: | |||||||||
| Bug Depends on: | 241244 | ||||||||
| Bug Blocks: | |||||||||
| Attachments: |
|
||||||||
|
Description
Aurelien Pupier
From the stacktrace, the Worker 32 calls JMM.putInfos() which has locked and worker 1 and main are waiting at JMM.getInfo() to unlock. The last part of worker 32 has the following ST: java.lang.Object.wait(Native Method) org.eclipse.ui.internal.Semaphore.acquire(Semaphore.java:43) org.eclipse.ui.internal.UISynchronizer.syncExec(UISynchronizer.java:164) org.eclipse.swt.widgets.Display.syncExec(Display.java:4813) Moving to ui for comments. Adding Andrey, our "deadlock solving expert". Can we have a *full* thread dump please (as attachment)? I must confess, the unusual stack format confuses me. Beside this this seem to be a classical deadlock. main waits on Worker-32, Worker-32 waits on main. The root problem starts at main, where JavaModelManager.getInfo (guarded by a lock on JavaModelManager) is accessed from UI thread (implicit lock on UI is taken and the next one on JavaModelManager can't be acquired). So the main waits on a job. The misery continues with the Worker-32 job, which gets an exception in ContentType.describe() and that code logs an exception, which the WorkbenchErrorHandler decides to show via Display.syncExec => kaboom, deadlock is perfect. I wonder why WorkbenchErrorHandler is allowed to run syncExec from a non UI context (I see this was added via bug 241244 in commit 1904f477320dac4a9f4d45f7be478efba4a0b735). This is highly dangerous. I will check this. Hi Andrey, sorry, not possible to provide you the full thread stack. I extracted the interesting part of the stacks and forget to create the full thread stack after. It happened only one time, I don't know how to reproduce the issue. The concrete problem manifestation could be probably "fixed" by changing org.eclipse.core.internal.content.ContentType.describe(IContentDescriber, ILazySource, ContentDescription) implementation NOT to log *any* Error like below but throw "new IllegalStateException(e)" in the code below:
} catch (Error e) {
// describer got some serious problem. disable it (logging the reason) and throw the error again
invalidateDescriber(e);
throw e;
}
because org.eclipse.ui.internal.ide.IDEWorkbenchErrorHandler.handle(StatusAdapter, int) converts any innocent "log" to a *blocking* call to Display.syncExec on any OutOfMemoryError, StackOverflowError, VirtualMachineError or SWTError.
The root cause is however, the fix for bug 241244 (on WorkbenchErrorHandler) which has a great potential for deadlocks. IMHO the bug 241244 should be reopened and fix reverted, because it introduces much severe problems as it supposes to fix.
I think I have an idea. New Gerrit change created: https://git.eclipse.org/r/81517 There are some problems with the attached patch: 1. It creates extra threads, which is expensive. 2. It conditionally changes syncExecs into asyncExecs based on activity on other threads. If the caller was relying upon the blocking behavior as an invariant, that invariant is broken. I would like to suggest an alternative: Unconditionally replace the syncExec with asyncExec. This is simpler and more efficient than the proposed conditional approach, it's no more breaking than the current approach due to 2), above... and if it does cause breakage, those breakages will be much easier to reproduce and write tests for since they won't depend on the activity of other threads. In other words, I agree with your initial assessment that the patch for bug 241244 is the root cause and should be reverted. Re: comment 5 I think that calling invalidateDescriber is appropriate since the framework wants to disable buggy describers such that they don't generate further errors. However: IDEWorkbenchErrorHandler.handle is probably being too greedy in appending the blocking flag to error statuses. The purpose of adding the BLOCKING status is to ensure that the user will still see the error message if an error occurs which is so serious that it is right about to cause a crash-to-desktop. The error that occurred in this circumstance was almost certainly not that serious. We should figure out what sort of error it was and ensure that IDEWorkbenchErrorHandler.isFatal(...) returns false for it. Aurelien, do you still have the log from the run that generated this freeze? It almost certainly logged the problematic exception immediately before the freeze. Created attachment 264320 [details]
log the day after
Created attachment 264321 [details]
log few days before
I hit a spatiotemporal flaw... I have log files from the 14/09/16 and for the 20/09/16 but not for the 19/09/16 when the deadlock occurred. So I attached them but I'm not sure that ti will give a lot of information. New Gerrit change created: https://git.eclipse.org/r/81663 (In reply to Stefan Xenos from comment #8) > There are some problems with the attached patch: > > 1. It creates extra threads, which is expensive. > 2. It conditionally changes syncExecs into asyncExecs based on activity on > other threads. If the caller was relying upon the blocking behavior as an > invariant, that invariant is broken. The original intent was to introduce smallest possible change, but I agree with your assessment. > I would like to suggest an alternative: Unconditionally replace the syncExec > with asyncExec. > > This is simpler and more efficient than the proposed conditional approach, > it's no more breaking than the current approach due to 2), above... and if > it does cause breakage, those breakages will be much easier to reproduce and > write tests for since they won't depend on the activity of other threads. > > In other words, I agree with your initial assessment that the patch for bug > 241244 is the root cause and should be reverted. This should be done by https://git.eclipse.org/r/81663. I haven't reverted the entire patch, but only fixed the most dangerous part of it causing deadlocks. Gerrit change https://git.eclipse.org/r/81663 was merged to [master]. Commit: http://git.eclipse.org/c/platform/eclipse.platform.ui.git/commit/?id=4a17b8684b5f45f7fad8d36e904b17fff424011f New Gerrit change created: https://git.eclipse.org/r/82388 (In reply to Eclipse Genie from comment #16) > New Gerrit change created: https://git.eclipse.org/r/82388 @Stefan, do you mind to review this backport patch to 4.6.2? @Lars: I need +1 from component lead to backport. (In reply to Andrey Loskutov from comment #17) > @Stefan, do you mind to review this backport patch to 4.6.2? P.S: sorry for timing, 4.6.2 nightly build is broken so patch is not even compilable with hudson. > @Stefan, do you mind to review this backport patch to 4.6.2?
Yes. Do it. I added a +1 review flag in bugzilla. Do I need to do anything else?
(In reply to Stefan Xenos from comment #19) > > @Stefan, do you mind to review this backport patch to 4.6.2? > > Yes. Do it. I added a +1 review flag in bugzilla. Do I need to do anything > else? I think I need +1 *on the patch* from you and +1 from Lars (or other platform UI lead) on this bug. Gerrit change https://git.eclipse.org/r/82388 was merged to [R4_6_maintenance]. Commit: http://git.eclipse.org/c/platform/eclipse.platform.ui.git/commit/?id=fade9db40d6ed0cfb263fa64206cbc89e3772217 Verified on Win 7/Linux 64 with build id: M20161116-1100 by looking on the source code of WorkbenchErrorHandler and test results containing the new "testDeadlockFromBug501681" test, see http://download.eclipse.org/eclipse/downloads/drops4/M20161116-1100/testresults/html/org.eclipse.ui.tests_ep46M-unit-lin64_linux.gtk.x86_64_8.0.html. (In reply to Eclipse Genie from comment #21) > Gerrit change https://git.eclipse.org/r/82388 was merged to > [R4_6_maintenance]. > Commit: > http://git.eclipse.org/c/platform/eclipse.platform.ui.git/commit/?id=fade9db40d6ed0cfb263fa64206cbc89e3772217 > Increasing the minor segment was wrong. The version was already correct since its service segment got increased for 4.6.2. (In reply to Dani Megert from comment #23) > (In reply to Eclipse Genie from comment #21) > > Gerrit change https://git.eclipse.org/r/82388 was merged to > > [R4_6_maintenance]. > > Commit: > > http://git.eclipse.org/c/platform/eclipse.platform.ui.git/commit/?id=fade9db40d6ed0cfb263fa64206cbc89e3772217 > > > > Increasing the minor segment was wrong. The version was already correct > since its service segment got increased for 4.6.2. Tests are not considered API and hence their version follows the main bundle. See https://wiki.eclipse.org/Version_Numbering#Plug-ins_with_no_API for details. (In reply to Dani Megert from comment #24) > Tests are not considered API and hence their version follows the main > bundle. See https://wiki.eclipse.org/Version_Numbering#Plug-ins_with_no_API > for details. OK, is there anything we should do right now for 4.6.2 or for 4.6.3? (In reply to Andrey Loskutov from comment #25) > (In reply to Dani Megert from comment #24) > > Tests are not considered API and hence their version follows the main > > bundle. See https://wiki.eclipse.org/Version_Numbering#Plug-ins_with_no_API > > for details. > > OK, is there anything we should do right now for 4.6.2 or for 4.6.3? No. Unfortunately, once a build with the wrong number is out, we can't go back. The number in master is higher than in the R4_6_maintenance, so we're OK. Once 4.6.2 is out we should consider to align the version to the "real" bundle. This may have caused a regression: bug 508696 Note that I haven't confirmed that it is actually caused by this patch and the (possible) regression is less severe than the deadlock reported here, so I'm *not* suggesting we revert this fix even if it caused the regression. |