Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 310396 - [implementation] Deadlock in SharedASTProvider.getAST()
Summary: [implementation] Deadlock in SharedASTProvider.getAST()
Status: VERIFIED FIXED
Alias: None
Product: JDT
Classification: Eclipse Project
Component: Text (show other bugs)
Version: 3.6   Edit
Hardware: All All
: P3 normal (vote)
Target Milestone: 3.6 RC1   Edit
Assignee: Dani Megert CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-04-25 17:43 EDT by Martin Oberhuber CLA
Modified: 2010-05-17 09:13 EDT (History)
3 users (show)

See Also:
markus.kell.r: review+


Attachments
Thread dump of deadlocked eclipse (27.01 KB, text/plain)
2010-04-25 17:43 EDT, Martin Oberhuber CLA
no flags Details
Fix (866 bytes, patch)
2010-05-11 10:11 EDT, Dani Megert CLA
no flags Details | Diff
Fix (886 bytes, patch)
2010-05-12 10:15 EDT, Dani Megert CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Oberhuber CLA 2010-04-25 17:43:38 EDT
Created attachment 166031 [details]
Thread dump of deadlocked eclipse

Build ID: I20100422-1310

I was editing a Java file (Project.java from o.e.core.internal.resources).
I think that I tried scrolling the editor right when Eclipse deadlocked.

Thread dump attached.
Comment 1 Martin Oberhuber CLA 2010-04-25 17:52:24 EDT
Thread dump seems to indicate that Threads

  main
  Worker-48
  Worker-50
  Text Viewer Hover Presenter

all try to access the shared AST, and are all locked in its object monitor. At the same time,

  Worker-49
  Worker-48
  Worker-41
  Worker-40

hangs in SelectionListenerWithASTManager, and a couple of JavaReconciler Threads are also waiting...
Comment 2 Markus Keller CLA 2010-04-25 18:49:50 EDT
To be investigated for M7.
Comment 3 Remy Suen CLA 2010-04-25 19:02:37 EDT
I've also had my Eclipse hang these past couple of days. There's also bug 302581 which I presume is related.
Comment 4 Dani Megert CLA 2010-04-28 09:33:56 EDT
>To be investigated for M7.
I take this as a +1 for RC1. I need to look again whether the proposed fix from bug 302581 really was OK but currently I'm on a business trip.
Comment 5 Dani Megert CLA 2010-05-11 10:11:16 EDT
John, can you please review the patch?
Comment 6 Dani Megert CLA 2010-05-11 10:11:44 EDT
Created attachment 167926 [details]
Fix
Comment 7 John Arthorne CLA 2010-05-11 22:33:47 EDT
I don't understand why this patch fixes the problem. It looks like the problem is that threads are waiting on reconciliation, but no reconcile is happening. For this to happen it means either there was no notification, or the notification happened before the isReconciling() condition changed. Since the fIsReconciling field is already volatile, adding synchronization around assigning that field shouldn't make any difference.

I don't know this code very well, but I noticed there is one code path at the top of the cached() method where there is no notification and it just returns. I can't clearly see how this leads to the deadlock, but the code is complicated enough that maybe it is possible. Ideally if any state changes that affects the outcome of isReconciling() there should be a notification within the same synchronized block.
Comment 8 Dani Megert CLA 2010-05-12 10:14:34 EDT
>I don't understand why this patch fixes the problem. 
I thought that a possible scenario for the deadlock could be that for some reason the progress monitor gets canceled (fIsReconciling -> true) while getAST(...) is already inside the fWaitLock. The patch should have closed that hole - however, I noticed that this would not fix that scenario because if the thread dies then it won't be able to notify it later.

After looking at the code and trying to reproduce the deadlock I found the scenario and the fix for the deadlock.

Test Case:
 0  start dev/host workspace
 1. add the following breakpoints - but disable all at this point
    ASTProvider, line 318
    ASTProvider, line 440
    ASTProvider, line 452
    ASTProvider, line 599
 2. start target workspace that has a Java project and two CUs (C1, C2)
 3. disable 'Mark Occurrences' in the target
 4. paste this into Package Explorer:
import java.util.HashMap;;
public class C1 {HashMap m} 
public class C2 {}
 5. switch/activated editor for C1
 6. enable bpt 440
 7. switch/activated editor for C2 ==> hit bpt 440
 8. disable bpt 440
 9. switch to C1
10. enable bpt 599 and 452
11. type a space in C1 ==> hit bpt 599
12. select HashMap m; and press Ctrl+C  ==> hit bpt 452
13. resume all threads except main and the one that waits in bpt 440
14. enable bpt 318
15. resume the thread that waits in bpt 440
16. resume the main thread
17. resume all other waiting threads
==> deadlock


The problem is that we call isReconciling(...) without synchronizing it with fReconcileLock and unfortunately we cannot add this inside the fWaitLock without introducing yet another deadlock. There are two possible fixes:
1) We synchronized the synchronized (fReconcileLock) {...} in 
  aboutToBeReconciled(...) with the fWaitLock. This is risky.
2) We switch the two statements in the synchronized block in 
  aboutToBeReconciled(...). This works because the important condition is
  is changed first and hence it will not wait inside getAST(...). That the
  fReconciling flag is set to true afterwards doesn't harm.

Since we already have RC1 and introducing another lock is always risky, I will fix this using 2).

I went through the fix with Markus and he agrees on that. John, can you agree with that as well?
Comment 9 Dani Megert CLA 2010-05-12 10:15:21 EDT
Created attachment 168146 [details]
Fix
Comment 10 Markus Keller CLA 2010-05-12 12:59:57 EDT
+1 for RC1.

Reproduced the deadlock with given steps, except for
> 3. disable 'Mark Occurrences' in the target
where I had to *enable* Mark Occurrences. Verified that the fix solves this issue.
Comment 11 Dani Megert CLA 2010-05-12 13:10:24 EDT
Patch committed to HEAD.
Comment 12 Markus Keller CLA 2010-05-17 08:55:05 EDT
Starting verification...
Comment 13 Markus Keller CLA 2010-05-17 09:13:32 EDT
Verified in I20100516-0800 that the scenario from comment 8 can't happen any more and that the dump from comment 0 likely ran into this problem.