Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 164637 - Deadlock in Test Navigator
Summary: Deadlock in Test Navigator
Status: CLOSED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: TPTP (show other bugs)
Version: unspecified   Edit
Hardware: PC Windows XP
: P1 critical (vote)
Target Milestone: ---   Edit
Assignee: Julien Canches CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 157089
  Show dependency tree
 
Reported: 2006-11-15 08:50 EST by Julien Canches CLA
Modified: 2016-05-05 10:28 EDT (History)
6 users (show)

See Also:


Attachments
Fix (1.79 KB, patch)
2006-11-15 08:54 EST, Julien Canches CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Julien Canches CLA 2006-11-15 08:50:22 EST
A deadlock has been reported on some machines (see bugzilla 157089). The UI thread stack is the following when the deadlock occurs:
Collections$SynchronizedSet(Collections$SynchronizedCollection).add(Object)
line: not available 
FileProxyNodeCache.addResourceListener(IResourceChangeListener) line: 284       
DefaultTypeProvider.<init>() line: 48   
TypeProviderManager$DefaultTypeProviderInfo.<init>(TypeProviderManager, String)
line: 111       
TypeProviderManager.getTypeProviderProxyNode(IProject, String) line: 183        
LogicalTestNavigatorProvider.getTypeProviders(IProject) line: 85        
LogicalTestNavigatorProvider.getChildren(Object) line: 68       
TreeViewer(AbstractTreeViewer).getRawChildren(Object) line: 1166        
TreeViewer.getRawChildren(Object) line: 768     
TreeViewer(AbstractTreeViewer).getFilteredChildren(Object) line: 574    
TreeViewer(AbstractTreeViewer).getSortedChildren(Object) line: 543      
.
.
Comment 1 Julien Canches CLA 2006-11-15 08:54:19 EST
Created attachment 53908 [details]
Fix

Here's the patch to the deadlock issue.
Comment 2 Mark D Dunn CLA 2006-11-15 09:05:47 EST
I assume this is for Julien since he already has a fix for it.  
Comment 3 amehrega CLA 2006-11-15 10:51:52 EST
Although the patch resolves the initial issue that I reported, I found another case where the test perspective hangs the workbench.

When I check-out test-results/platform/org.eclipse.hyades.use.cases from TPTP's cvs and switch to the Test perspective, the workbench hangs with my CPU maxed out.  An out of memory exception is eventually thrown.

I'm not sure if it matters or not but the 'logical' toolbar item of the test navigator was enabled before I checked out the plug-in.
Comment 4 Sheldon Lee-Loy CLA 2006-11-15 10:56:49 EST
I verified that this patch works in my scenario.  I no longer get a dead lock.
Comment 5 Paul Slauenwhite CLA 2006-11-15 11:07:39 EST
(In reply to comment #3)

Ali, please provide the stack trace.
Comment 6 Julien Canches CLA 2006-11-15 11:53:36 EST
(In reply to comment #3)
> When I check-out test-results/platform/org.eclipse.hyades.use.cases from TPTP's
> cvs and switch to the Test perspective, the workbench hangs with my CPU maxed
> out.  An out of memory exception is eventually thrown.

There are two other bugzillas that may track this problem:
160224 TestNavigator: proxies should not use the common resource...
160221 [Performance] TestNavigator: Large execution results display is excessively slow
These two fixes (which were deferred to 4.4) address some performance & memory issues.
Comment 7 Paul Slauenwhite CLA 2006-11-16 08:48:14 EST
Since this problem only occurs on specific platforms (e.g. not exposed during i3 TP1) and cannot be consistently reproduced, decrease the severity to major to match the other related defects (160224, 160221 and 157089).  Furthermore, since the fix for this defect does not cover all problem scenarios and may introduce negative side effects due to the nature of the fix, deferring to 4.4.  All four defects (160224, 160221 and 157089) will be fixed at the beginning of the release to ensure thorough testing during the release's test passes.
Comment 8 amehrega CLA 2006-11-16 10:40:15 EST
(In reply to comment #5)
> Ali, please provide the stack trace.

A stack trace won't be of any use for the scenario that I've mentioned in comment#3.  The hang is not being cause by a deadlock.  There is activity because my CPU is being maxed out.  I get a different stack trace when the main thread is paused.  I suspect that there is an infinite loop that we never step out of.

(In reply to comment #7)
Actually the problem is consistently reproducible on machines where the deadlock happens and it is not platform specific.  As discussed with Paul, I don't think this should have been deferred to 4.4.  Hanging the workbench can result in the lost of unsaved resources/preferences/etc...
Comment 9 Julien Canches CLA 2006-11-16 11:12:50 EST
(In reply to comment #8)
I get a different stack trace when the main
> thread is paused.  I suspect that there is an infinite loop that we never step
> out of.
I still think this is a different issue related to performance, not to an infinite loop. At the end of 4.3i2, I spent some time profiling the test navigator to identify performance issues, and I worked with the project org.eclipse.hyades.logging.core.tests (which contains very large test logs). The initial discovery of the project's content, on a decent machine, takes about 80 minutes. The main issue was in ICU code, in which 80% of time is spent. Decorators are also a major slowdown. After a few optimizations (mainly by adopting a "lazy" strategy), the time falls to 4 minutes. This issue is tracked in https://bugs.eclipse.org/bugs/show_bug.cgi?id=160221.
Comment 10 Paul Slauenwhite CLA 2006-11-16 12:24:48 EST
(In reply to comment #8)

I was looking for the stack trace of the memory exception.
Comment 11 Paul Slauenwhite CLA 2006-11-16 12:30:04 EST
(In reply to comment #7)

Based on Ali's prompting, Julien and I reviewed this defect again and we concur with Ali.  Although we could not reproduce/verify Bob's scenario (and he is equally busy with unrelated 4.3 deliverables so he could not test Julien's fix), Julien's fix does resolve both Sheldon and Ali's issue.  After a thorough review of Julien fix, we are proposing that this defect be committed in 4.3.  As such, retargetting and raising the severity to critical since is some cases, data loss is experienced.
Comment 12 Paul Slauenwhite CLA 2006-11-16 12:31:00 EST
1) Explain why you believe this is a stop-ship defect.  How does the defect manifest itself, and how will users of TPTP / consuming products be affected if the defect is not fixed?

The defect manifests itself as a deadlock, in which the UI thread is involved. As a result, eclipse stops responding and the user has to kill the application. The defect appears randomly when the user switches to the logical view or opens the test launch configuration. The probability increases on multi-core machines.

2) Is there a work-around?  If so, why do you believe the work-around is insufficient?

There is no work-around.

3) Is this a regression or API breakage?  Explain.

This is a regression since 4.2, which did not show this problem.

4) Does this require new API?

No.

5) Who performed the code review?

Paul Slauenwhite.

6) Is there a test case attached to the bugzilla record?

No, but regressions that the fix might cause are covered by regular Test Navigator tests (Test.UI.TestNavigator_Logical.testsuite).

7) What is the risk associated with this fix?

The fix is very local, and consists in narrowing the scope of a synchronized {} block. The synchronized{} was covering code that actually did not need to be synchronized, and resulted in the deadlock.
Before the fix:
synchronized(listeners) {
  // This block makes sure that no listener gets registered/deregistered while iterating through the list
  // for each listener: notify each listener
}

After the fix:
synchronized(listeners) {
  copy = make_copy(listeners);
}
// for each item of copy, notify it

8) Is this fix related to any standards that TPTP adheres to?  If so, who has validated that the fix continues to adhere to the standard?

No.
Comment 13 Paul Slauenwhite CLA 2006-11-17 08:44:26 EST
As requested by the PMC as part of the approval process, enclosed is the background information and cause of the deadlock:

Involved threads: 
- T1: UI thread 
- T2: Resource Change notification thread 

Involved critical resources: 
- FileProxyNodeCache.resourceListeners (a synchronized set) 
- Display.syncExec()'s implicit lock 

Deadlock scenario: 

T2: 
A resource changed event is issued in T2. Method FileProxyNodeCache.resourceChanged() is invoked. 
T2 acquires a lock on the shared collection "resourceListeners": 
        public void resourceChanged(IResourceChangeEvent event) { 
                IResourceDelta delta = event.getDelta(); 
            if(delta == null) return; 
            //- first delegates to other listeners registered in the cache. 
=>          synchronized(resourceListeners) { 
                    for (Iterator it = resourceListeners.iterator(); it.hasNext();) { 
                        IResourceChangeListener listener = (IResourceChangeListener) it.next(); 
                        try { 
                            listener.resourceChanged(event); 
                        } catch (Throwable e) { 
                            UiPlugin.logError(e); 
                        }                 
                    } 
            } 
            // ... 
        } 

T1: 
The user switches to the logical view. 
One of the logical providers invokes FileProxyNodeCache.addResourceListener(this): 
public void addResourceListener(IResourceChangeListener listener) { 
=>    resourceListeners.add(listener); 
} 
The statement above is blocked because T2 holds a lock on "resourceListeners". 

T2: 
Each listener in the set "resourceListeners" is notified of the event. 
One of these listeners happens to be an instance of ResourceChangeUpdaterProvider.UIUpdaterProvider. 
The method ResourceChangerUpdaterProvider.UIUpdaterProvider.doProcessDelta gets invoked: 
                protected void doProcessDelta(IResourceDelta _delta) 
                { 
                        this.delta = _delta; 
                        
                        if (asynOperation) 
                                org.eclipse.swt.widgets.Display.getDefault().asyncExec(this); 
                        else 
=>                                org.eclipse.swt.widgets.Display.getDefault().syncExec(this); 
                } 
In this case asynOperation==false, so Display.getDefault().syncExec(this) is invoked. 
As a result, T2 is blocked until T1 executes the runnable. But T1 is waiting for T2 to release the locked synchronized set. 

Resolution: 
The synchronized set was locked while iterating through it to prevent another thread from modifying the set while iterating. 
But holding a lock on this set is not necessary while every listener processes the event. 
So a local copy of the listeners set is created (in a synchronized block), then the listeners are notified outside the synchronized block: 
        public void resourceChanged(IResourceChangeEvent event) { 
                IResourceDelta delta = event.getDelta(); 
            if(delta == null) return; 
            //- first delegates to other listeners registered in the cache. 
            IResourceChangeListener[] copy; 
            synchronized(resourceListeners) { 
                        copy = (IResourceChangeListener[]) resourceListeners.toArray(new IResourceChangeListener[resourceListeners.size()]); 
            } 
            for (int i = 0; i < copy.length; i++) { 
                try { 
                    copy[i].resourceChanged(event); 
                } catch (Throwable e) { 
                    UiPlugin.logError(e); 
                }                 
            } 
        } 

Note: It is not possible to replace syncExec() with asyncExec() in this case. 
Comment 15 Paul Slauenwhite CLA 2006-11-17 12:40:45 EST
Pached-checked in to CVS (HEAD).

Requesting for Hubert to patch the TPTP-4.3.0-200611160100 candidate driver.
Comment 16 Paul Slauenwhite CLA 2006-11-17 14:35:42 EST
Will be packaged with the fix from https://www.eclipse.org/bugs/show_bug.cgi?id=164988.
Comment 17 Paul Slauenwhite CLA 2006-11-17 14:38:42 EST
File afffected:

/cvsroot/tptp/test/org.eclipse.hyades.test.ui/src/org/eclipse/hyades/test/ui/internal/navigator/proxy/FileProxyNodeCache.java
Comment 18 Bob Duncan CLA 2006-11-22 17:24:04 EST
(In reply to comment #11)
> ... Although we could not reproduce/verify Bob's scenario (and he is
> equally busy with unrelated 4.3 deliverables so he could not test Julien's
> fix), Julien's fix does resolve both Sheldon and Ali's issue ...

My scenario was identical to Sheldon's scenario. I just verified my scenario against the patched driver TPTP-4.3.0-200611160100A and my scenario is fixed by it.
Comment 19 Julien Canches CLA 2006-11-23 12:00:49 EST
Bug 165682 has been created as a follow-up, to address the excessive locking performed by the other piece of code involved in the deadlock.
Comment 20 Paul Slauenwhite CLA 2007-06-02 14:50:57 EDT
Reporter: Please verify and close in preparation for shutting down the TPTP 4.4 release.  Thanks.
Comment 21 Paul Slauenwhite CLA 2007-06-27 13:18:25 EDT
Closing by default since not verified/closed by the originator in the 6+ months since being resolved.  

Please reopen if the issue is still present in the latest TPTP release or the
resolution is not correct.