Community
Participate
Working Groups
A deadlock has been reported on some machines (see bugzilla 157089). The UI thread stack is the following when the deadlock occurs: Collections$SynchronizedSet(Collections$SynchronizedCollection).add(Object) line: not available FileProxyNodeCache.addResourceListener(IResourceChangeListener) line: 284 DefaultTypeProvider.<init>() line: 48 TypeProviderManager$DefaultTypeProviderInfo.<init>(TypeProviderManager, String) line: 111 TypeProviderManager.getTypeProviderProxyNode(IProject, String) line: 183 LogicalTestNavigatorProvider.getTypeProviders(IProject) line: 85 LogicalTestNavigatorProvider.getChildren(Object) line: 68 TreeViewer(AbstractTreeViewer).getRawChildren(Object) line: 1166 TreeViewer.getRawChildren(Object) line: 768 TreeViewer(AbstractTreeViewer).getFilteredChildren(Object) line: 574 TreeViewer(AbstractTreeViewer).getSortedChildren(Object) line: 543 . .
Created attachment 53908 [details] Fix Here's the patch to the deadlock issue.
I assume this is for Julien since he already has a fix for it.
Although the patch resolves the initial issue that I reported, I found another case where the test perspective hangs the workbench. When I check-out test-results/platform/org.eclipse.hyades.use.cases from TPTP's cvs and switch to the Test perspective, the workbench hangs with my CPU maxed out. An out of memory exception is eventually thrown. I'm not sure if it matters or not but the 'logical' toolbar item of the test navigator was enabled before I checked out the plug-in.
I verified that this patch works in my scenario. I no longer get a dead lock.
(In reply to comment #3) Ali, please provide the stack trace.
(In reply to comment #3) > When I check-out test-results/platform/org.eclipse.hyades.use.cases from TPTP's > cvs and switch to the Test perspective, the workbench hangs with my CPU maxed > out. An out of memory exception is eventually thrown. There are two other bugzillas that may track this problem: 160224 TestNavigator: proxies should not use the common resource... 160221 [Performance] TestNavigator: Large execution results display is excessively slow These two fixes (which were deferred to 4.4) address some performance & memory issues.
Since this problem only occurs on specific platforms (e.g. not exposed during i3 TP1) and cannot be consistently reproduced, decrease the severity to major to match the other related defects (160224, 160221 and 157089). Furthermore, since the fix for this defect does not cover all problem scenarios and may introduce negative side effects due to the nature of the fix, deferring to 4.4. All four defects (160224, 160221 and 157089) will be fixed at the beginning of the release to ensure thorough testing during the release's test passes.
(In reply to comment #5) > Ali, please provide the stack trace. A stack trace won't be of any use for the scenario that I've mentioned in comment#3. The hang is not being cause by a deadlock. There is activity because my CPU is being maxed out. I get a different stack trace when the main thread is paused. I suspect that there is an infinite loop that we never step out of. (In reply to comment #7) Actually the problem is consistently reproducible on machines where the deadlock happens and it is not platform specific. As discussed with Paul, I don't think this should have been deferred to 4.4. Hanging the workbench can result in the lost of unsaved resources/preferences/etc...
(In reply to comment #8) I get a different stack trace when the main > thread is paused. I suspect that there is an infinite loop that we never step > out of. I still think this is a different issue related to performance, not to an infinite loop. At the end of 4.3i2, I spent some time profiling the test navigator to identify performance issues, and I worked with the project org.eclipse.hyades.logging.core.tests (which contains very large test logs). The initial discovery of the project's content, on a decent machine, takes about 80 minutes. The main issue was in ICU code, in which 80% of time is spent. Decorators are also a major slowdown. After a few optimizations (mainly by adopting a "lazy" strategy), the time falls to 4 minutes. This issue is tracked in https://bugs.eclipse.org/bugs/show_bug.cgi?id=160221.
(In reply to comment #8) I was looking for the stack trace of the memory exception.
(In reply to comment #7) Based on Ali's prompting, Julien and I reviewed this defect again and we concur with Ali. Although we could not reproduce/verify Bob's scenario (and he is equally busy with unrelated 4.3 deliverables so he could not test Julien's fix), Julien's fix does resolve both Sheldon and Ali's issue. After a thorough review of Julien fix, we are proposing that this defect be committed in 4.3. As such, retargetting and raising the severity to critical since is some cases, data loss is experienced.
1) Explain why you believe this is a stop-ship defect. How does the defect manifest itself, and how will users of TPTP / consuming products be affected if the defect is not fixed? The defect manifests itself as a deadlock, in which the UI thread is involved. As a result, eclipse stops responding and the user has to kill the application. The defect appears randomly when the user switches to the logical view or opens the test launch configuration. The probability increases on multi-core machines. 2) Is there a work-around? If so, why do you believe the work-around is insufficient? There is no work-around. 3) Is this a regression or API breakage? Explain. This is a regression since 4.2, which did not show this problem. 4) Does this require new API? No. 5) Who performed the code review? Paul Slauenwhite. 6) Is there a test case attached to the bugzilla record? No, but regressions that the fix might cause are covered by regular Test Navigator tests (Test.UI.TestNavigator_Logical.testsuite). 7) What is the risk associated with this fix? The fix is very local, and consists in narrowing the scope of a synchronized {} block. The synchronized{} was covering code that actually did not need to be synchronized, and resulted in the deadlock. Before the fix: synchronized(listeners) { // This block makes sure that no listener gets registered/deregistered while iterating through the list // for each listener: notify each listener } After the fix: synchronized(listeners) { copy = make_copy(listeners); } // for each item of copy, notify it 8) Is this fix related to any standards that TPTP adheres to? If so, who has validated that the fix continues to adhere to the standard? No.
As requested by the PMC as part of the approval process, enclosed is the background information and cause of the deadlock: Involved threads: - T1: UI thread - T2: Resource Change notification thread Involved critical resources: - FileProxyNodeCache.resourceListeners (a synchronized set) - Display.syncExec()'s implicit lock Deadlock scenario: T2: A resource changed event is issued in T2. Method FileProxyNodeCache.resourceChanged() is invoked. T2 acquires a lock on the shared collection "resourceListeners": public void resourceChanged(IResourceChangeEvent event) { IResourceDelta delta = event.getDelta(); if(delta == null) return; //- first delegates to other listeners registered in the cache. => synchronized(resourceListeners) { for (Iterator it = resourceListeners.iterator(); it.hasNext();) { IResourceChangeListener listener = (IResourceChangeListener) it.next(); try { listener.resourceChanged(event); } catch (Throwable e) { UiPlugin.logError(e); } } } // ... } T1: The user switches to the logical view. One of the logical providers invokes FileProxyNodeCache.addResourceListener(this): public void addResourceListener(IResourceChangeListener listener) { => resourceListeners.add(listener); } The statement above is blocked because T2 holds a lock on "resourceListeners". T2: Each listener in the set "resourceListeners" is notified of the event. One of these listeners happens to be an instance of ResourceChangeUpdaterProvider.UIUpdaterProvider. The method ResourceChangerUpdaterProvider.UIUpdaterProvider.doProcessDelta gets invoked: protected void doProcessDelta(IResourceDelta _delta) { this.delta = _delta; if (asynOperation) org.eclipse.swt.widgets.Display.getDefault().asyncExec(this); else => org.eclipse.swt.widgets.Display.getDefault().syncExec(this); } In this case asynOperation==false, so Display.getDefault().syncExec(this) is invoked. As a result, T2 is blocked until T1 executes the runnable. But T1 is waiting for T2 to release the locked synchronized set. Resolution: The synchronized set was locked while iterating through it to prevent another thread from modifying the set while iterating. But holding a lock on this set is not necessary while every listener processes the event. So a local copy of the listeners set is created (in a synchronized block), then the listeners are notified outside the synchronized block: public void resourceChanged(IResourceChangeEvent event) { IResourceDelta delta = event.getDelta(); if(delta == null) return; //- first delegates to other listeners registered in the cache. IResourceChangeListener[] copy; synchronized(resourceListeners) { copy = (IResourceChangeListener[]) resourceListeners.toArray(new IResourceChangeListener[resourceListeners.size()]); } for (int i = 0; i < copy.length; i++) { try { copy[i].resourceChanged(event); } catch (Throwable e) { UiPlugin.logError(e); } } } Note: It is not possible to replace syncExec() with asyncExec() in this case.
PMC request for 4.3: http://dev.eclipse.org/mhonarc/lists/tptp-pmc/msg02595.html http://dev.eclipse.org/mhonarc/lists/tptp-pmc/msg02594.html
Pached-checked in to CVS (HEAD). Requesting for Hubert to patch the TPTP-4.3.0-200611160100 candidate driver.
Will be packaged with the fix from https://www.eclipse.org/bugs/show_bug.cgi?id=164988.
File afffected: /cvsroot/tptp/test/org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/internal/navigator/proxy/FileProxyNodeCache.java
(In reply to comment #11) > ... Although we could not reproduce/verify Bob's scenario (and he is > equally busy with unrelated 4.3 deliverables so he could not test Julien's > fix), Julien's fix does resolve both Sheldon and Ali's issue ... My scenario was identical to Sheldon's scenario. I just verified my scenario against the patched driver TPTP-4.3.0-200611160100A and my scenario is fixed by it.
Bug 165682 has been created as a follow-up, to address the excessive locking performed by the other piece of code involved in the deadlock.
Reporter: Please verify and close in preparation for shutting down the TPTP 4.4 release. Thanks.
Closing by default since not verified/closed by the originator in the 6+ months since being resolved. Please reopen if the issue is still present in the latest TPTP release or the resolution is not correct.