Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 329488

Summary: DSF ACPM deadlocks
Product: [Tools] CDT Reporter: John Cortell <john.cortell>
Component: cdt-debug-dsfAssignee: Pawel Piech <pawel.1.piech>
Status: RESOLVED FIXED QA Contact: Pawel Piech <pawel.1.piech>
Severity: normal    
Priority: P3 CC: cdtdoug
Version: 8.0Flags: pawel.1.piech: review? (john.cortell)
Target Milestone: 8.0   
Hardware: PC   
OS: Windows XP   
Whiteboard:
Attachments:
Description Flags
stack crawls showing deadlock
none
stack crawls showing deadlock (second case)
none
A different fix.
pawel.1.piech: iplog-
Additional fix in RM.
pawel.1.piech: iplog-
Patch to call cancelled(). pawel.1.piech: iplog-

Description John Cortell CLA 2010-11-04 16:17:51 EDT
The cache range unit tests experience intermittent deadlocks. I can reproduce them 100% of the time by adding a Thread.sleep(100) at the beginning of RequestCache.canceled(). I was hopeful at first; it seemed a simple adjustment to RequestCache.canceled() would fix the problem--I made it unsynchronized and tweaked the implementation to:

    	RequestMonitor rm = fRm;
    	if (rm != null) {
            rm.cancel(); 
        } 

That fixed that particular deadlock (cache_deadlock.gif), but a second, similar one was close behind (cache_deadlock2.gif). Both are the result of threads locking the RM and cache object in opposing order.
Comment 1 John Cortell CLA 2010-11-04 16:18:15 EDT
Created attachment 182425 [details]
stack crawls showing deadlock
Comment 2 John Cortell CLA 2010-11-04 16:18:31 EDT
Created attachment 182426 [details]
stack crawls showing deadlock (second case)
Comment 3 Pawel Piech CLA 2010-11-09 01:19:05 EST
Created attachment 182681 [details]
A different fix.

At first I thought of the change in RequestMonitor.canceled() also.  Another option is to force handleCanceledRm() to always be called on the DSF executor thread... as in this patch.

There is still a problem though.  I think canceled() should be called from isCanceled() when a cancellation is detected, but I think I couldn't find a way to do that safely.  The cancellation handling is particularly difficult because it needs to be thread safe (since RequestMonitor.isCanceled()) is thread safe, but the rest of the cache logic is synchronized using the executor thread.
Comment 4 John Cortell CLA 2010-11-09 09:19:38 EST
(In reply to comment #3)
> The cancellation handling is particularly difficult because
> it needs to be thread safe (since RequestMonitor.isCanceled()) is thread safe,
> but the rest of the cache logic is synchronized using the executor thread.

Indeed. I couldn't think of a good solution, either. But since we're talking about changing the cancel behavior, maybe we should first sort that out and then see if this is still an issue.
Comment 5 Pawel Piech CLA 2010-11-10 19:16:13 EST
Created attachment 182862 [details]
Additional fix in RM.

The previous fix was committed with bug 329481.  This fix addresses another issue that contributed to this deadlock: The whole RequestMonitor.isCancel() method is synchronized and it calls the parent RM's isCanceled() method while holding the lock.  This is unnecessary and in this case it contributed to a deadlock.
Comment 6 Pawel Piech CLA 2010-11-10 19:29:43 EST
(In reply to comment #3)
> There is still a problem though.  I think canceled() should be called from
> isCanceled() when a cancellation is detected, but I think I couldn't find a way
> to do that safely.  The cancellation handling is particularly difficult because
> it needs to be thread safe (since RequestMonitor.isCanceled()) is thread safe,
> but the rest of the cache logic is synchronized using the executor thread.

I still have to think about this one a little more.
Comment 7 Pawel Piech CLA 2010-11-11 13:04:39 EST
Created attachment 182925 [details]
Patch to call cancelled().

(In reply to comment #3)
> There is still a problem though.  I think canceled() should be called from
> isCanceled() when a cancellation is detected, but I think I couldn't find a way
> to do that safely.  The cancellation handling is particularly difficult because
> it needs to be thread safe (since RequestMonitor.isCanceled()) is thread safe,
> but the rest of the cache logic is synchronized using the executor thread.

This patch extends on the last one and changes the logic in isCanceled() so that it does not modify the list of waiting requests.  Instead it uses a separate dispatch call to handle the canceled RMs and to call AbstractCache.canceled().
Comment 8 Pawel Piech CLA 2010-11-11 13:05:46 EST
I committed the fixes, John please review.
Comment 10 John Cortell CLA 2010-11-11 16:18:06 EST
1. Very minor. We can avoid the local variable in RequestMonitor.isCanceled() with

    synchronized {
        if (fCanceled) {
             return true;
        }
    }


2. This is not new behavior, but it seems to me AbstractCache.isCanceled() returning 'true' when fWaitingList is null is either incorrect or the method is poorly named. fWaitingList  is nulled after a call to set(). Someone calling isCanceled() after that would get an incorrect response--at least a misleading one. If the current behavior is intended, then renaming the method to isBeingWaitedOn() would reduce confusion, IMO.


3. Asynchronously calling handleCanceledRm() opens the door for isCanceled() being called again and re-requesting the removal of the same set of canceled RMs. It appears to be harmless, but it is confusing. There should be a comment acknowledging that possibility.

4. The new local TestCache() instance in CacheTests.cancelWhilePendingWithoutClientNotificationTest() should probably call super.canceled() for thoroughness.