| Summary: | ModelManagerImpl's yieldRule() use still susceptible to deadlock | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [WebTools] WTP Source Editing | Reporter: | Min Idzelis <min123> | ||||||
| Component: | wst.sse | Assignee: | Min Idzelis <min123> | ||||||
| Status: | RESOLVED FIXED | QA Contact: | Nitin Dahyabhai <thatnitind> | ||||||
| Severity: | critical | ||||||||
| Priority: | P3 | CC: | cameron.bateman, david_williams, ian.trimble, nsand.dev, raghunathan.srinivasan | ||||||
| Version: | 3.2 | Flags: | nsand.dev:
review+
thatnitind: review+ |
||||||
| Target Milestone: | 3.2 M7 | ||||||||
| Hardware: | PC | ||||||||
| OS: | Windows XP | ||||||||
| Whiteboard: | |||||||||
| Attachments: |
|
||||||||
|
Description
Min Idzelis
Created attachment 163842 [details]
Patch that uses ILocks to avoid deadlock
Apologies in advance for not keeping up with all the technical advances here (and doubt that I could even if I tried) but a few questions some to mind ... 1. Does the ILock prevention of deadlock only work if Display is involved? I seem to recall that Display could reorder events to get out of deadlock, and wonder if that is the key to it working here? Is that what you mean by "ILocks have automatic deadlock avoidance"? 2. If so, does the deadlock prevention reorder events? If so, isn't there a chance of changing expected outcome? (of order was not what was expected). Again, I'm just asking for my own education, not questioning the solution (since I haven't studied it and doubt I could evaluate it fairly, even if I did). Much thanks, (In reply to comment #2) > Apologies in advance for not keeping up with all the technical advances here > (and doubt that I could even if I tried) but a few questions some to mind ... > > 1. Does the ILock prevention of deadlock only work if Display is involved? I > seem to recall that Display could reorder events to get out of deadlock, and > wonder if that is the key to it working here? Is that what you mean by "ILocks > have automatic deadlock avoidance"? The deadlock avoidance of ILocks work outside of the Display thread, but they are UI aware. By default, RCP apps (like Eclipse) use special UI-aware lock handlers: UILockListener and UISynchronizer. These help avoid deadlocks for syncExecs. In this environment, attempts to do lock.acquire() on the UI thread will NOT cause the event loop to spin, but will cause the Display thread to service pending syncExec()'s (only). (Waiting beginRule's on the UI thread do the same.) Because it is just like any other ILock, using the ILock in the display thread will avoid deadlocks there as well. The Javadocs for ILock say it pretty good. (Except they don't mention that ILocks are also ISchedulingRule aware and consider them as unsuspendable locks) Locks are capable of detecting and recovering from programming errors that cause circular waiting deadlocks. When a deadlock between two or more ILock instances is detected, detailed debugging information is printed to the log file. The locks will then automatically recover from the deadlock by employing a release and wait strategy. One thread will lose control of the locks it owns, thus breaking the deadlock and allowing other threads to proceed. Once that thread's locks are all available, it will be given exclusive access to all its locks and allowed to proceed. A thread can only lose locks while it is waiting on an acquire() call. > > 2. If so, does the deadlock prevention reorder events? If so, isn't there a > chance of changing expected outcome? (of order was not what was expected). It doesn't re-order events per say, but it does re-order the ILocks. Because of the deadlock avoidance of ILocks, you should be aware of the 'gotchas' that can be caused... i.e. if a ILock you thought you had acquired for yourself could be passed to another thread. It can then re-enteres a code path that acquires the first ILock, the ILock.acquire() will be successful - perhaps unexpectantly? So, ILock is a misnomer. It's really truely a lock when there is no deadlock. If there is, the deadlock avoider will shuffle it around. This weirdness has several mitigations. Your lock(s) could only be given up if you are in the process of doing an acquire() (on a different lock). So, if you only ever use a single ILock, you shouldn't feel it at all. If you use more than a single ILock you should be more careful. Usage note: the ILock is not used as 'lock' in SharedObject - which is why it works ;-) There is a flag that says whether the SharedObject is done loading or not. The flag is always set before the ILock is released. That means if the waiting SharedObject successfully acquires that ILock and notices that initialization is still going on - it knows that it's ILock was shuffled around. Since in this situation a deadlock would result (because the wait routine will spin while initialization is going on...) and there is nothing we can do to avoid it (without crazy code restructing and tons of new API) we through a cancelation exception to indicate that we aborted a circular load. Note: this is a theoretical situation. The only way this can happen is a situation where there is a circular dependency between models. (Which smells a bit funny to me....) i.e. 1) Loading ModelA depends on ModelB 2) Loading ModelB depends on ModelA Disregarding the necessity to prevent infinite recursion by having the secondary model to know not to load the first if its already started loading... there can be a deadlock here... If two threads happen to try to do both of these at the same time, you'd hit that cyclical case - and one of the secondary models will abort with an OperationCanceledException. > Again, I'm just asking for my own education, not questioning the solution > (since I haven't studied it and doubt I could evaluate it fairly, even if I > did). > > Much thanks, Notwithstanding the usage of the new jobs API we still seem to be at a less ideal solution than we had for 3.1.x. In essence, you have basically the same solution, but without any ability to configure the timeout. The net affect to Oracle's adopter product is as follows: 1) JUnit tests that work fine in 3.1.x deadlock with 3.2I. 2) With this patch applied to 3.2 HEAD all of our model manager reliant test cases fail due to too-early deadlock timeout. It is incredibly frustrating given the amount of effort we went to in 3.1.x to find a solution that met your requirements only find out that after ignoring all that work in 3.2 you have come back to basically the same solution, only we are now a broken adopter. I am raising this to critical because as an adopter we will be completely broken by this and previous changes if it ships in this form. (In reply to comment #4) > Notwithstanding the usage of the new jobs API we still seem to be at a less > ideal solution than we had for 3.1.x. In essence, you have basically the same > solution, but without any ability to configure the timeout. The net affect to > Oracle's adopter product is as follows: > > 1) JUnit tests that work fine in 3.1.x deadlock with 3.2I. > 2) With this patch applied to 3.2 HEAD all of our model manager reliant test > cases fail due to too-early deadlock timeout. > > It is incredibly frustrating given the amount of effort we went to in 3.1.x to > find a solution that met your requirements only find out that after ignoring > all that work in 3.2 you have come back to basically the same solution, only we > are now a broken adopter. > > I am raising this to critical because as an adopter we will be completely > broken by this and previous changes if it ships in this form. Our goal is not to cause any frustration but to find the optimal technical solution to this problem. Please attach a code snippet (or JUnit TestCase) that shows the deadlock you are seeing. It should be possible to create a mock environment that reproduces the deadlock you are seeing. Then, by fixing the provided TestCase we can be sure that our solution also addresses your concerns. We will then incorporate this test case into the SSE JUnit tests to make sure that you won't be broken in the future. (In reply to comment #5) It is too late in the Helios release cycle to be making these changes since we will not have time to test them in the adopter product. The solution that exists in the 3.1.2 release has been tested in the adopter product and works reasonably well. Hence I suggest we revert to the solution implemented in the 3.1.2 for Helios and then work on a long-term solution in the next release. We'll redo things so that the timeout option is reinstated and available, but the default behavior going forward will be some combination of yieldRule() and ILock-based deadlock avoidance. As Min and I were talking this over, the question came of of whether you'd be averse to us throwing an OperationCanceledException when the timeout expires instead of simply returning null. Null's an incorrect return value since it can mean either an unsupported file type or this timeout. You can't really tell which circumstance happened, and this unpredictability could lead to mysterious problems down the road. Not that we're fixated on throwing the exception, but we're curious if you'd be ok with that change in that code path for the sake of avoiding "random funkiness" down the road. Created attachment 165769 [details]
ILock + timeout preference patch
(In reply to comment #5) > (In reply to comment #4) > ... Please attach a code snippet (or JUnit TestCase) that > shows the deadlock you are seeing. ... While I know the previous behavior is being restored (optionally) I'd like to emphasize how important it is to get to the bottom of the problem. Important for everyone ... so, short of a test case, wouldn't it be helpful to even have a stack dump taken once the hang occurs? Or is that too difficult to interpret without full source? I was just thinking if it "occurs frequently" seems a stack dump should be easy to get. And thanks Min, Nitin, everyone, for continuing to work on this hard problem. >In reply to comment #8)
> Created an attachment (id=165769) [details]
> ILock + timeout preference patch
This looks good from our perspective. We will test further in the M7 and try to raise any issues that arise quickly. Thanks for your responsiveness on this.
> While I know the previous behavior is being restored (optionally) I'd like to
> emphasize how important it is to get to the bottom of the problem. Important
The root of the problem is ultimately that a client thread may be asked to do a large amount of initialization. Locks may be held on entry, the ModelManagerImpl itself takes and releases locks and also calls into arbitrary code (depending on the content type of the file being loaded it could include JDT for example) while holding these locks. This make it basically impossible to control lock ordering in an open system with many possible unknown (adopters) callers.
Based on similar problems we have in JSF, I have come to believe that the long term solution is change the way the model manager is used. Instead of having calls be synchronous and "block until completion" from the caller's perspective, we need to move to more of a rendezvous type system.
The way ModelProvider handles model modifications offers a hint of how to do this: all clients wrap their model handling logic in some type of runnable that is queued to the model manager for execution. When a request is queued, a separate Job is scheduled to ensure that the particular model is initialized. If it is not, it initializes it. Once initialization is complete, queued runnables are executed. If the model is already loaded, then runnables may be able to be executed immediately.
This creates the following benefits:
1) Client threads never block waiting for a model: they queue the logic needed and continue on. They may choose to block of their own accord, but that is their choice.
2) The lock ordering for the thread that loads a shared model is predicable, since we create the job and know what order it takes locks, at least until it calls into FileBufferManager.
3) All client thread follow the same execution path to the model and there should be far few locks needed.
Looks fine to me. Some minor changes I would make is to change SharedObject.id from volatile to final and remove the commented out blocks of code from PrefUtil. Please let us know when this patch will be available in the WTP build for us to test. Released to the build. (In reply to comment #13) > Please let us know when this patch will be available in the WTP build for us to > test. org.eclipse.wst.sse.core=v201004262353 (and newer) |