| Summary: | [hotbug] ModelManagerImpl is highly susceptible to deadlock | ||
|---|---|---|---|
| Product: | [WebTools] WTP Source Editing | Reporter: | Cameron Bateman <cameron.bateman> |
| Component: | wst.sse | Assignee: | Min Idzelis <min123> |
| Status: | RESOLVED FIXED | QA Contact: | Nitin Dahyabhai <thatnitind> |
| Severity: | blocker | ||
| Priority: | P2 | CC: | Andrew.McCulloch, brian.vosburgh, carlin.rogers, david_williams, fziglar, konstantin, min123, neil.hauge, nsand.dev, raghunathan.srinivasan |
| Version: | 3.1.1 | Flags: | raghunathan.srinivasan:
pmc_approved+
thatnitind: pmc_approved? (naci.dai) thatnitind: pmc_approved? (deboer) neil.hauge: pmc_approved+ thatnitind: pmc_approved? (kaloyan) david_williams: pmc_approved+ |
| Target Milestone: | 3.1.1 | ||
| Hardware: | All | ||
| OS: | All | ||
| URL: | ORACLE_P1 | ||
| See Also: | https://bugs.eclipse.org/bugs/show_bug.cgi?id=306757 | ||
| Whiteboard: | PMC_approved | ||
| Bug Depends on: | 283446, 283449 | ||
| Bug Blocks: | |||
| Attachments: | |||
|
Description
Cameron Bateman
Created attachment 144990 [details]
A first step toward avoiding the deadlock (does not fix the root issue)
This patch is a proposed workaround for the deadlock problem. It changes the way that SharedObject.waitForLoadAttempt and setLoaded methods work so that:
1) It never tries to acquire the SharedObject monitor. Instead, it relies on a private lock that can only be acquired from with the class (technically, SharedObject should be pulled out of ModelManagerImpl to ensure this, but I'm trying to keep the patch simple).
2) The wait for load algorithm is modified so that it is both interruptible and has a maximum wait time. If the thread is interrupted or the timeout expires before the model is loaded, a runtime exception is thrown to try and unwind any locks that were taken.
3) The initializing and waitFor flags were made volatile. In the simple single read and write scenarios, this allows us have thread-safe memory consistent access without acquiring a lock.
4) Calls to waitForLoadAttempt are no longer made while the SharedObject lock is held.
We would like this bug to be resolved for WTP 3.1.1 release. The adopter product has an upcoming release based on WTP 3.1.1 and this bug is a blocker. Hot bug request details: Affliation: Oracle Target Release: WTP 3.1.1 This is a problem we've been aware of and it is very difficult to solve. (I'll get into the details later.) But before I go on, I would like to suggest some things you could try on your side without requiring changes to ModelManagerImpl. From the stack traces, and your description, it seems that Worker-7/VariablesController$DocumentRediscoveryJob is running within a Project scheduling rule. Would it be possibly to modify this Job to run with *no* (null) rule? Then, if you would call he getModel* APIs with no-rule, you would not prevent the other thread from completing the load() - avoiding deadlock. Remember, you can start a rule within a Job using Job.getJobManager().beginRule( project, monitor ). Looking over the patch, it seems like it is an API-breaking change. It may be binary-compatible but existing clients are not expecting this exception. Also, 3s is an arbitrary number: for some large models on slower machines, it could take 3 seconds or more to load. Lastly, the code in ModelManagerImpl itself doesn't tolerate exceptions well. Many (all?) of the lock.acquire() and release() methods are not found in try-finally blocks. An exception may cause the locks to not be properly released. (The wrap locks in try-finally should be a separate issue from this - but it may be aggravated by this patch.) The **real** way to fix this depends on bug 283449. I'll describe it quickly. Lets consider the same scenario. Your Job+ProjectRule is running (so it has the rule). Another thread is in process of loading the model you want in a separate thread, and is blocked on starting a workspace rule. **If** we were allowed to "transfer" the ProjectRule from your Job to the thread loading the model, that model loading thread will become "unblocked" and will complete the load. We could do this transparently to the client. Your Job will call getModel() which will perform the rule transfer behind the scenes. Then after the load attempt is finished, it will transfer the rule back to the caller thread (the Job). This solution would be transparent to the client. Please let me know if the workaround I mentioned earlier will work for you. If not, we'll see if we can come up with something else until we can get a proper fix. > scheduling rule. Would it be possibly to modify this Job to run with *no* > (null) rule? Then, if you would call he getModel* APIs with no-rule, you No this is not an option. Not only are we relying on rule scheduling to ensure integrity, but there are plenty of other threads we don't control such the ValidationJob, J2EE classpath reconciler job and Js Indexing thread that are project scheduled and call the model manager. > Remember, you can start a rule within a Job using Job.getJobManager > ().beginRule( project, monitor ). I'm not sure how that helps me when my main objective is to place a scheduling lock on the file that I am trying load the model for (actually for everything in the project since I'm typically responding to a resource change event). > Looking over the patch, it seems like it is an API-breaking change. It may be > binary-compatible but existing clients are not expecting this exception. It's not a checked exception, so I don't see the problem. It will behave the same as if an NPE was thrown. We can make it an Error if you are worried that somebody is catching RuntimeException's. > 3s is an arbitrary number: for some large models on slower machines, it could Yes. I'm happy to go higher; this just makes it easier to see it happen in my tests. > Lastly, the code in ModelManagerImpl itself > doesn't tolerate exceptions well. Many (all?) of the lock.acquire() and > release() methods are not found in try-finally blocks. The waitForLoadAttempt is never called while holding this lock, but I'm happy to add the try-finally's that should be there anyway. > The **real** way to fix this depends on bug 283449. In this case, I think we can solve the problem by not holding the sharedObject lock when calling out to the file buffer manager or by imposing lock ordering by starting a rule on the file being loaded before acquiring the sharedObject lock. I am looking at implementing a patch for that as well. This first patch, seemed like a lower risk solution to propose first since it seems to help us break the deadlock. I agree it's not ideal, but the current situation is even less ideal. (In reply to comment #5) > I'm not sure how that helps me when my main objective is to place a scheduling > lock on the file that I am trying load the model for (actually for everything > in the project since I'm typically responding to a resource change event). The IResource APIs will automatically acquire locks as needed, is it that you need to lock multiple files at the same time? > > Looking over the patch, it seems like it is an API-breaking change. It may be > > binary-compatible but existing clients are not expecting this exception. > > It's not a checked exception, so I don't see the problem. It will behave the > same as if an NPE was thrown. We can make it an Error if you are worried that > somebody is catching RuntimeException's. > > 3s is an arbitrary number: for some large models on slower machines, it could > > Yes. I'm happy to go higher; this just makes it easier to see it happen in my > tests. Not being a checked exception IS the problem. While it may be source compatible, the change in behavior is not compatible with code designed for 3.1.0 that expected to run to completion or not at all. > The IResource APIs will automatically acquire locks as needed, is it that you > need to lock multiple files at the same time? Just like the validation and J2EE class path reconciler jobs, we are effectively reindexing a project's contents each time there is a resource change event that is of interest. We want to be sure that the project is not deleted, re-built or it's contents changed while we are reindexing. Even if we run with no rule, the problem will still occur because the other jobs like JavaScript Indexer and the Dependency Graph job run with the rule locks taken. > Not being a checked exception IS the problem. While it may be source > compatible, the change in behavior is not compatible with code designed for > 3.1.0 that expected to run to completion or not at all. I suppose we could catch the exception within ModelManagerImpl and behave as though no model exists (return null). Given that we are trying to detect a critical failure, a runtime exception seems like a far more appropriate response. (In reply to comment #5) > > scheduling rule. Would it be possibly to modify this Job to run with *no* > > (null) rule? Then, if you would call he getModel* APIs with no-rule, you > > No this is not an option. Not only are we relying on rule scheduling to ensure > integrity, but there are plenty of other threads we don't control such the > ValidationJob, J2EE classpath reconciler job and Js Indexing thread that are > project scheduled and call the model manager. > I'm guessing here (haven't studies the traces or implementation) ... but does the code that starts the VariablesController$DocumentRediscoveryJob actually know what jsp document model(s) it needs, before it starts the job? If so, you might be able get the model(s) first, and then start your job. (and then release models once job is done). Once it's "cached" and in the Model Manger, there'd be much less calling out to foreign code (such as to JDT). And, none of us would argue that the ModelManager is correct ... but, given it is hard or dangerous to change thread behavior in a maintenance release ... I have this distant memory where such problems have been solved in the past (by adopters) by them moving the (first) "getModel" up higher in the code chain, where it can be called with less deadlock prone rules. Certainly doesn't solve all issues ... just mentioning it in case it helps in your case. (In reply to comment #7) > ... we are > effectively reindexing a project's contents each time there is a resource > change event that is of interest. Are you starting the work/job in the resourceChanged code? That's known to be deadlock prone ... even though we unfortunately do it all over the place. If it helps, the recommended pattern is to "set a flag" such as "reindex needed" and then next time you need that index information, you'd know to do all the reindexing you need. Maybe that's what you are already are doing ... just trying to help. > Are you starting the work/job in the resourceChanged code? That's known to
> be deadlock prone ... even though we unfortunately do it all over the place.
To clarify on this point. Starting a job that then executes on a different thread is fine from resourceChanged code. What one should never do is to run code from the resource change listener callback that might try to acquire a lock that can also be held while touching any part of the resource tree (even if you aren't modifying it). This is far too easy to do. The natural tendency when you have data structures that are backed by resources is to first lock the data structure and then to try to access the resources. Bad idea. You are just asking for deadlocks. I had a lot of fun with that in the faceted project code. You can take a look at FacetedProject class to get an idea of what it takes to get this locking deadlock-safe. It's not pretty to say the least.
> Are you starting the work/job in the resourceChanged code? That's known to be
No, we launch a WorkspaceJob from the resourceChanged method to handle things in the background.
> the code that starts the VariablesController$DocumentRediscoveryJob actually
> know what jsp document model(s) it needs, before it starts the job? If so,
> you might be able get the model(s) first, and then start your job. (and then
We do know, but this would be the same as running on a null rule: if our thread enters the ModelManagerImpl first and a second thread with a resource lock enters after, but before model finishes loading, then our thread will hold SharedObject while the second one waits for it and we will seek (via the text buffer manager) the resource lock while holding SharedObject.
Also, passing a shared instance between threads seems a little dangerous, given we don't know when (if ever) that thread will run and so might never dispose the object.
We'll talk more about this during the teleconference we have scheduled on Monday. However, I would like to talk about some of the technical details and limitations that we're faced with. First off, let me explain what has to happen when you call getModel* APIs on ModelManager. Before I start, some terminology. When I talk about locks, I'm referring to any type of Scheduling Rule that corresponds to any IResource. In the past, there had been one "master" lock. This corresponds to the WorkspaceRoot, and effectively "locks" the entire workspace for modifications. This is known as the workspace lock. Starting in Eclipse 3.0, the JobManager was invented and the workspace lock was broken down into finer-grained, hierarchical "locks." In fact, IResource extents ISchedulingRule. Although you can create your own rules, it was really intended that you just use the IResources as is. In fact, (as of Eclipse 3.2) to be absolutely correct, you should always use the IResourceRuleFactory instead of using the resources directly to get the appropriate lock. (ResourceMappings allow two "logically" related resources to be treated atomically). Also, note that although these locks are fine-grained and hierarchical, they are still mutual exclusion locks (mutexes). There is no such thing as "read" locks. You do not need a rule to read a file. All scheduling rule locks are considered "write" locks. The closest thing to a read lock is null. But since there are no real read locks, there is no way to prevent a resource from being changed (or deleted) while you are reading the file. You either need to write your code to tolerate this or suffer the performance degradation of locking the resource using the mutex lock. I say degradation because if you are strictly a read-only op (like an indexer) that is long running, you'll prevent any other writes from going on in the workspace which can hold up things like saving a file, etc. If Eclipse had real read locks, we could had implement reader/writer locks that allow multiple readers (no writers), or 1 writer at a time. This is not the case. Now getting back to the issue at hand: It was intented that you do not should not be *required* to hold any lock when you call into getModel* on ModelManager. It was also indented to not *prevent* you from holding a lock when you call getModel* Unfortunately, by trying to accurately obey the JSP spec, and the implementation of JDT and the Eclipse Platform -- this in *some cases* breaks down. (This bug being one.) As it turns out, some code paths deep in the implementation of getModel* actually *need* to acquire a Workspace Root (the master) lock. As of the JSP 1.2 spec, the structure of the resulting DOM of a JSP that contains custom tags is *dependent* on on TLD for the custom tag's <body-content> definition. If it is "tagdependent" it can contain things like embedded SQL and should not be parsed as regular DOM nodes. So for JSPs that have custom tags, we need to parse the TLD in order to parse the JSP. The spec allowes for TLDs to live within any .jar file in the web-apps classpath. That means we need to get the classpath. It turns out that asking JDT to resolve a classpath (JavaCore#getClasspathContainer()) *may* require the workspace root. That code and conditions under which it happens is hairy, but we already asked, and JDT must use the workspace root here because during its resolve classpath because it might update the IProject's description. (Specifically the project/references/"build order" property.) This ultimately boils down to the IResource framework itself. (See bug 241751 for all the gory details) This means there's an additional restriction: even if you did NOT hit this deadlock, there are (rare) occasions where calling into getModel* with any lock besides workspace root will cause a runtime exception. I.E. If you are holding a project/file lock while calling into getModel* and it happens that JDT isn't fully initialized (early startup, workbench restore with active editors, etc) you may get that infamous "Attempted to beginRule: /R, does not match outer scope rule: P/Project". ** You could say that (currently) calling getModel* while holding any lock other than WorkspaceRoot (or null) is NOT supported due to this limitation. ** Actually, that's not even completely true. Thats only when you are performing getModel* in isolation (without other threads wanting to do the same.) If multiple threads are doing the same - you have the additional restriction that ALL threads must hold the workspace lock or NONE of the thread must hold the lock. Why? * If every thread that accessed getModel* was running with the workspace lock, only one could run at a time. And it would allow the rare situation when JDT actually needs the workspace lock to run as well - since it would already be held. * If NONE of the threads where holding the lock, then you'd get multiple entries into getModel(). However, JDT's resolveClasspath *would* be able to acquire the workspace lock, because no other threads would be holding onto any lock (like the IProject lock) since you JDT can't resolve the classpath (start a workspace lock) when any other lock (even finer grained) is held. Obviously, neither of these situations is ideal. Since we can't impose that everybody must hold the workspace lock when calling into getModel*, we must (temporarily) impose that no locks be held when calling into getModel* - but only when multiple threads are running at once on the same file. There are things that complicate how we can solve this issue: 1) There is no reliable way to determine which rules are active: bug 283446 (though this won't solve the issue completely, it could help) 2) Need a way to transfer rules away from a job: bug 283449 (This will allow us to fix this issue appropriately, see my previous comment) Because of these very complicated issues, I think the best place to solve this issue is to work around the problem on your side. I believe my previous suggestion could work for you. Let me explain. To be clear, a Job running with a Rule (setRule()) is equivalent to a Job/Thread running that calls beginRule() (after the call to begin rule). That is: (pseudo-code for brevity) Job j = new Job() { run() { doStuff() } }; j.setRule( root ); is the same as (exluding very-subtle differences...) Job j = new Job() { run() { beginRule(root); doStuff(); endRule() } } Because calling beginRule() will "lock" that rule, I think if you should be able to modify your code so that the getModel() is performed outside of the lock object, but that you will still lock the workspace/project for writes for the remainder of your job... i.e. Job j = new Job() { run() { model = getModel(); beginRule(root); doStuff(model); endRule() } } Why will this work? When multiple threads try doing a getModel() on the same resource the same exact same DOM model must be returned. (1 model per resource). The model manager ensures that only 1 load operation per resource will occur at a time. When two threads want the same model, 1 will have to wait, and the other will perform the load. When the load is finished, the same model is passed to both callers/threads. The deadlock occurs because one of the threads contains the workspace lock that is required by the other thread to finish loading the model. The deadlock occurs when it turns out (by chance) that the thread that doesn't contain the lock started loading the model first, and then the other thread acquired the project rule - preventing the first thread from finishing. If you adjust the first thread to NOT have the project lock when calling getModel() is doesn't matter which thread starts to load the model. (In fact, you shouldn't use a project lock anyways due to other reasons mentioned above - but you could use the workspace lock). Because no one owns the workspace lock, the either thread is able to acquire the workspace lock successfully and finish loading successfully. Ok - now for our ideas on how to fix this problem: 1) By far the best approach would be bug 283449. This will allow us to temporarily pass the rule that is causing the deadlock to the place where it can be used to finish the load attempt. This also requires some thread coordination, but it's simpler than the other options. 2) We could just "be inaccurate" and ignore the TLDs when parsing JSPs. (Not acceptable). --The following 2 are ones I just thought of while writing this response, so it might be half-baked and requires research. 2) We could adjust the taglibindex.resolve() method to try to attempt-execution on all threads involved in loading the same model. This would be extremely complicated to get right, since it would involve exchanging objects between threads and lots of thread coordination. 3) Sorta similar: We could just have each thread load the same file (into different models) at the same time. (Instead of just choosing one at random like current.) Whoever finishes first will "win". The other thread will continue creating the model, but this model will be "thrown away" and replaced with the winner to maintain the API restriction that there is only 1 model per resource. (The code was never designed to do this... so there might be lurking issues). This'll probably also be very complicated to get right. == I'm confident that we can find a workaround that produces "correct" behavior. But, there could be another debate here about the RuntimeException. If we were to throw a RuntimeException it will be a breaking API change. But, philosophically it might be desirable to throw an Exception instead of get into a deadlock - which also is a disastrous result. This is a point open to debate. The RuntimeException doesn't actually fix your problem (you'd get the exception, and your job will abort.) But it will prevent deadlock which is a very good goal to have. I *might* be able to be convinced that it would be a good compromise if we increase the delay to a very long value - say 30-45 seconds, and then throw the exception to break deadlock. I'd like to hear what others think. At least, this would prevent other clients (besides yourself) from hitting this unfortunate bug until we get a proper fix for it. Since the proper fixes will take a lot of time and testing to ensure correctness. Though if we find workaround that works for you, this point may be moot. == Remember, even after this is fixed, you should not hold any rule besides null/root when calling into APIs (like project/file rules). No plans to resolve that at the moment. (In reply to comment #10) > > Are you starting the work/job in the resourceChanged code? That's known to > > be deadlock prone ... even though we unfortunately do it all over the place. > > To clarify on this point. Starting a job that then executes on a different > thread is fine from resourceChanged code. What one should never do is to run > code from the resource change listener callback that might try to acquire a > lock that can also be held while touching any part of the resource tree (even > if you aren't modifying it). This is far too easy to do. The natural tendency > when you have data structures that are backed by resources is to first lock the > data structure and then to try to access the resources. Bad idea. You are just > asking for deadlocks. I had a lot of fun with that in the faceted project code. > You can take a look at FacetedProject class to get an idea of what it takes to > get this locking deadlock-safe. It's not pretty to say the least. > More points/tips: 1) Agreed: You can start jobs in IResourceChangeListener#resourceChanged(). Just remember according to javadoc that the delta you get in the resourceChanged() method is..."is valid only for the duration of the invocation of this method." That is, if you pass this delta to your Job that you start the resource tree may have changed further. 2) Agreed: Don't lock any part of the resources tree during resourceChanged(). Actually, you can't. This is enforced via the IResource/Scheduling Rules framework. The resourceChanged() method is invoked asynchronously under the "NotifyJob" (internal job). Regardless, your resourceChange() is run UNDER the same lock (scheduling rule) as the operation that was modifying the resources. If you try to start a rule that is not compatible with this rule you'll get the "Attempt to start rule: MyNewRule, does not match outer scope rule: WorkspaceModifyContextRule" which is pretty cool. Additionally, the Workspace.isTreeLocked() method will return true and prevent workspace modifications. 3) If you are doing a read-op, it's better just to perform the read within the context of this resource change. These resourceChanges() are processed asynchronously now, so they will not hold up the rest of the operation from proceeding. It's unlikely that the resource will change during this op, since at a minimum is is run under the rule that contains that resource. **Note: getModel* is not (always) just a read op!! :-( See my previous comment. 4) If you are modifying workspaces outside of a workspacerunnable running with AVOID_UPDATE you will periodically sent out deltas during the running of your op. (Import a large existing project, and see how the navigator updates while it imports for an example of this) 5) AVOID_UPDATE is a hint... don't depend on it. Deltas can be sent out at any time. > multiple threads are doing the same - you have the additional restriction that > ALL threads must hold the workspace lock or NONE of the thread must hold the > lock. Why? But it is not imposed and therefore both internal projects and adopters don't do this. For example, the manual validation job inside WTP still runs with a project lock. So deadlocks are simply less likely to occur but still inevitable. > ** You could say that (currently) calling getModel* while holding any lock > other than WorkspaceRoot (or null) is NOT supported due to this limitation. ** Why isn't this documented on the API? Also, why not log a warning in ModelManagerImpl if: Job.getJobManager().currentJob().getRule() != WSRoot || != null ? It won't help if someone has nested a null rule inside another rule, but it will help catch almost everything else. > Ok - now for our ideas on how to fix this problem: Why not do this: 1) Don't hold the SharedObject lock while calling outside ModelManagerImpl. Only use it to achieve synchronous access to SharedObject's fields. Other threads that try to get the model while it is still loading will still block on the waitForLoadAttempt, but their wait is on a separate monitor that fails on an interrupt (so that we can set up a watchdog timer in adopter code) and ideally times out after some long period, maybe 30s or a few minutes. We could also make that configurable so that end-users can tune it if they have a particularly slow machine etc. 2) Split ModelManagerImpl into at least two implementations: one for JSPs and one for everything else. Resolve the JSP-specific problems in that new model manager implementation. Since we know that a workspace root is required for JSP loading (if you can fix that elsewhere so much the better), attempt to acquire the workspace root before calling outside the model manager impl. This will throw an exception if called with anything but a null or wsroot rule (nested or not) and will enforce the requirements without waiting for the platform. 3) And of course, clean up ModelManagerImpl. There are plenty of things that could be cleaned up in there to make it more robust without even touching these problems. For example, as you point out, SYNC.acquire should be wrapped in a try-finally. Obviously, we can't do all that in a maintenance release, which is why I provided the simplest, least intrusive part of that change that will also solve the deadlock problem in all cases. > very good goal to have. I *might* be able to be convinced that it would be a > good compromise if we increase the delay to a very long value - say 30-45 > seconds, and then throw the exception to break deadlock. I'd like to hear what Even just throwing a RuntimeException when the wait() aborts on a interrupt exception would be good. We can set up a watchdog timer before we run code that calls into model manager that interrupts the thread after a set wait that we can adjust. My only hesitation on making that the sole mechanism to break the deadlock is that I'm not certain the VM guarantees that that interrupt will work in all cases: for example, if the timer fires the interrupt before wait is called. > Remember, even after this is fixed, you should not hold any rule besides > null/root when calling into APIs (like project/file rules). No plans to > resolve that at the moment. Into which API's? (In reply to comment #15) > > multiple threads are doing the same - you have the additional restriction that > > ALL threads must hold the workspace lock or NONE of the thread must hold the > > lock. Why? > > But it is not imposed and therefore both internal projects and adopters don't > do this. For example, the manual validation job inside WTP still runs with a > project lock. So deadlocks are simply less likely to occur but still > inevitable. > > ** You could say that (currently) calling getModel* while holding any lock > > other than WorkspaceRoot (or null) is NOT supported due to this limitation. ** > > Why isn't this documented on the API? Unfortunately most of that class is poorly documented. A documentation pass would be a very good idea. Also, why not log a warning in > ModelManagerImpl if: > > Job.getJobManager().currentJob().getRule() != WSRoot || != null ? > > It won't help if someone has nested a null rule inside another rule, but it > will help catch almost everything else. I agree. We should impose it - there are technical hurdles to this as well. i.e. we can't really know what current rule for a given Thread is (bug 283446). By imposing it, we'll also potentially break clients much more often than before. Instead of just having a slight possibility of error, you'd assert it upfront and the error will appear in all cases that just happened to work before. In a maintenance stream this could be disastrous, as existing clients written against the current API could break in unforeseen ways. > > Ok - now for our ideas on how to fix this problem: > > Why not do this: > > 1) Don't hold the SharedObject lock while calling outside ModelManagerImpl. > Only use it to achieve synchronous access to SharedObject's fields. Other > threads that try to get the model while it is still loading will still block on > the waitForLoadAttempt, but their wait is on a separate monitor that fails on > an interrupt (so that we can set up a watchdog timer in adopter code) and > ideally times out after some long period, maybe 30s or a few minutes. We could > also make that configurable so that end-users can tune it if they have a > particularly slow machine etc. I believe you are describing the patch you've attached, right? It really doesn't matter if the monitor is a private field or the SharedObject itself. The change in behavior is the timed-wait/interruption support. I'm more inclined to just support a timed-wait than interruption as well. I'd like interruption support to be consistent, and if we only supported interruption in the waitForLoad() method it would depend differently based on which thread happens to be waiting and which thread was performing the load. (The thread performing the actual load attempt cannot be interrupted. In fact, that is one of the other limitations we have in Eclipse. A beginRule() cannot be thread.Interrupted() - but it can be canceled if using a progress monitor). > 2) Split ModelManagerImpl into at least two implementations: one for JSPs and > one for everything else. Resolve the JSP-specific problems in that new model > manager implementation. Since we know that a workspace root is required for > JSP loading (if you can fix that elsewhere so much the better), attempt to > acquire the workspace root before calling outside the model manager impl. This > will throw an exception if called with anything but a null or wsroot rule > (nested or not) and will enforce the requirements without waiting for the > platform. It would be better to know if we really need to acquire a rule to update the classpath. However, that information is not exposed by JDT APIs. To me it seems unfair to *always* to acquire a workspace root rule if will not be used the majority of time. If we knew that JDT was going to acquire a rule, we could prevent deadlock by throwing the RuntimeException at that time -- but at that point -- we could just solve the problem. Instead of throwing the exception, we could just load have the other thread waiting on the SSE model do the resolveClasspath() on our behalf. That's one of the solutions to the problem I mentioned. The amount of work it will take to enforce the restrictions will also fix the problem correctly. > 3) And of course, clean up ModelManagerImpl. There are plenty of things that > could be cleaned up in there to make it more robust without even touching these > problems. For example, as you point out, SYNC.acquire should be wrapped in a > try-finally. Agreed: ModelManagerImpl should be cleaned up. There should be a non-internal very well documented Java Interface instead. > Obviously, we can't do all that in a maintenance release, which is why I > provided the simplest, least intrusive part of that change that will also solve > the deadlock problem in all cases. It would solve the deadlock but it is not a solution in the sense that it has the potential to break existing code using getModel* APIs. However, if it only breaks existing code that would have otherwise been stuck in a deadlock then I'm open to considering it. (Breaking code when faced with a deadlock is the lesser of two evils.) I think using a timed-wait after 30-45s without interruption should meet that criteria. Though, I'm still think we can create a solution for you that works - without changing ModelManagerImpl code. > > very good goal to have. I *might* be able to be convinced that it would be a > > good compromise if we increase the delay to a very long value - say 30-45 > > seconds, and then throw the exception to break deadlock. I'd like to hear what > > Even just throwing a RuntimeException when the wait() aborts on a interrupt > exception would be good. We can set up a watchdog timer before we run code > that calls into model manager that interrupts the thread after a set wait that > we can adjust. My only hesitation on making that the sole mechanism to break > the deadlock is that I'm not certain the VM guarantees that that interrupt will > work in all cases: for example, if the timer fires the interrupt before wait is > called. I'm not sure if I like supporting interrupts in this scenario. In addition to my previous concern with consistency -- it turns out Eclipse automatically performs Thread interrupts for you in some situations. If Eclipse happens to do this interrupt (which can happen very soon after you start executing) you might not ever be able to load a model. (Should Eclipse/Workbench be doing this is another matter open to debate). Details: if you are running on background Thread and then perform a Display.syncExec(), the background Thread will block on a timed-wait until the UI Thread is done executing your Runnable. Every 1000ms, it will interrupt() the UI thread. Not sure if it will be appropriate to clients to abort loading a SSE model in syncExec() blocks only after 1000ms. > > > Remember, even after this is fixed, you should not hold any rule besides > > null/root when calling into APIs (like project/file rules). No plans to > > resolve that at the moment. > > Into which API's? Any getModel* API when using the resource your interested in is a JSP. > By imposing it, we'll also potentially break clients much more often than > before. Instead of just having a slight possibility of error, you'd assert it > upfront and the error will appear in all cases that just happened to work > before. In a maintenance stream this could be disastrous, as existing clients > written against the current API could break in unforeseen ways. I suggest imposing in a major release (i.e. Helios) but start logging it sooner; perhaps 3.1.2 with some advanced warning to adopters. That way, nothing will break, but adopters can get some visibility on it and get 6 months to react. > I believe you are describing the patch you've attached, right? It really Nope. The patch is just step 1. The deeper problem is holding the SharedObject monitor lock while calling outside of ModelManagerImpl > doesn't matter if the monitor is a private field or the SharedObject itself. Actually it does matter. If you continue to lock waitForLoadAttempt/setLoaded on SharedObject's monitor, then the deadlock breaking code will never run since all threads after the first will block trying to enter waitForLoadAttempt instead of on the wait() (the loading thread is holding SharedObject's monitor). > The change in behavior is the timed-wait/interruption support. I'm more > inclined to just support a timed-wait than interruption as well. I'd like > interruption support to be consistent, and if we only supported interruption A fair point on consistency, but only the non-loading waiting threads would be interrupted. We could reset the interrupted flag on the thread if we are worried about callers depending on this being set. I'm not advocating interrupting the loading thread: we only to need to break the deadlock from one side. > Details: if you are running on background Thread and then perform a > Display.syncExec(), the background Thread will block on a timed-wait until the Can you point me to where it does this? In the code I'm looking at, the caller ends up in Synchronizer.syncExec where it does a no-timeout wait until it is awakened by a notifyAll in runAsyncMessages. If interrupted, it calls interrupt() on itself to reset it's interrupted flag, but this in itself does nothing else to the thread that I can see and anyway does nothing to flow of control until lock.done() is true. I also don't see where the framework explicitly calls interrupt with the intention of actually interrupting a blocked state. Created attachment 145676 [details]
Proposed patch (for 3.1 stream)
This patch includes the minimum amount of changes (does not include wrapping locks in try-finally).
It has two properties that it uses to alter behavior. The default behavior is unchanged.
The properties you can set are:
org.eclipse.wst.sse.core.modelmanager.maxWaitDuringConcurrentLoad
(default=0, 0=infinite)
org.eclipse.wst.sse.core.modelmanager.allowInterruptsDuringConcurrentLoad
(default=false)
These can be set via System Property, Environment Variable, or a Configuration Scope or Instance Scope preference. There is no UI to set this preference.
An example of how to set this property at runtime is:
InstanceScope scope = new InstanceScope();
IEclipsePreferences instancePrefs = scope.getNode("org.eclipse.wst.sse.core");
instancePrefs.put("modelmanager.maxWaitDuringConcurrentLoad", "10000");
instancePrefs.put("modelmanager.allowInterruptsDuringConcurrentLoad", "true");
try {
instancePrefs.flush();
}
catch (BackingStoreException e) {
e.printStackTrace();
}
Please test this and let me know if this is adequate.
Created attachment 145681 [details]
Full patch for 3.2 stream
This patch includes everything the previous one did, but also cleans up some of the code.
1) Instead of repeating the load routine 3 times, makes 1 generic method for that
2) try-finally's where appropriate
3) little bits of code cleanup, reducing unneeded local vars, etc.
(In reply to comment #17) > > > Details: if you are running on background Thread and then perform a > > Display.syncExec(), the background Thread will block on a timed-wait until the > > Can you point me to where it does this? In the code I'm looking at, the caller > ends up in Synchronizer.syncExec where it does a no-timeout wait until it is > awakened by a notifyAll in runAsyncMessages. If interrupted, it calls > interrupt() on itself to reset it's interrupted flag, but this in itself does > nothing else to the thread that I can see and anyway does nothing to flow of > control until lock.done() is true. > > I also don't see where the framework explicitly calls interrupt with the > intention of actually interrupting a blocked state. > See org.eclipse.ui.internal.UISynchronizer#syncExec() and UILockListener#interruptUI() ... I didn't have all "4 things necessary to cause a deadlock" at my fingertips during the conf. call, but I found it now. Mutual exclusion Non-preemptible Hold and wait Circular wait All we have to do to solve any deadlock is take one of these conditions away. If your interested in threading I highly recommend reading the presentations from John Arthorne. http://www.eclipsecon.org/2005/presentations/EclipseCon2005_Tutorial7.pdf http://www.eclipsecon.org/2004/EclipseCon_2004_TechnicalTrackPresentations/39_Arthorne-Lemieux.pdf Also, the book "Java Concurrency in Practice" by Brian Goetz, et al. is very informative and contains lots of best practices. I notice you have flipped around the use of monitor lock and additional loadLock variable compared to what I proposed (i.e. the monitor lock is still used for waiters and the loadLock is used protect the structuredModel field). While I'm not sure either approach is better, I think there are additional changes required to the patch as a result: 1) _incrCount and _initCount both still hold the sharedObject lock and call out to the FileBufferModelManager. This needs to be changed to hold the loadLock instead, at least for the portion that calls FileBufferModelManager. 2) _commonCreateModel calls _incrCount at line 287 (post patch) while holding the sharedObject monitor. This additionally needs to be changed to loadLock (in fact it might even be able to be removed since _incrCount already synchronizes on the loading lock. 3) The same is true of the synchronized calls added at 351, 450, 568 and 658 in the patched code. 4) A similar problem now exists with _decrCount. 5) The release protocol now also needs to be changed. Note the private method releaseFromRead(Object) still synchronizes on the sharedObject rather than the load lock to create mutual exclusion and, among other things, calls out to the FileBufferModelManager while disposing of the object while holding the sharedObject lock. 6) Also in releaseFromEdit(Object), the shared object lock is held while calling signalPostLifeCycleListenerRevert, the result of which can be to call out to arbitrary listener code attached to the AbstractStructuredModel. It seems like it would be simpler to use the new lock to manage waiters and leave the sharedObject lock everything else. > 4) A similar problem now exists with _decrCount.
A clarification on this: _decrCount itself doesn't synchronize, but it is always called while holding the sharedObject monitor lock.
(In reply to comment #21) The intention of the patch was to fix the deadlock -- not to completely remove holding locks while calling into FileBufferModelManager. The fix for the deadlock did not require the need to let go of the sharedObject lock while calling into the FileBufferModelManager.connect/disconnect methods. The point of the sharedObject.loadLock is to prevent exclude multiple threads from loading a single file at the same time. Now, this might not even be possible because of the way the master load/test/wait spin-loop is structured. (i.e. if multiple threads enter that method at the same time, one will start loading the model, and the other will wait. I don't see a way that multiple threads can enter the actual load routine... unless something super wacky was happening) -- Regardless, to be safe, I kept the lock there to ensure that it can't happen which I think is an appropriate thing to do in a maintenance-stream patch. If you'll examine the original stack trace, you'll see that [Worker-13] would be hold the SharedObject.loadLock instead of SharedObject. [Worker-7] would be allowed entry into waitForLoadAttempt() where it will enter wait() and give up its sharedObject lock. After [Worker-13] finishes the load of that model, it will acquire the SharedObject lock, set the model, and notify all blocked threads ([Worker-7]). Going further... the fundamental problem is not fixed, so the way this will play out is... [Worker-13] will never be able to finish the load of the model because it wants the Workspace Root scheduling rule - which conflicts with the Project rule of [Worker-7]. So, after several wake/sleep cycles (every 250ms) the timeout value you set is reached. Timeout will stop the wait-loop, and throw an OperationCanceledException which will go up the stack and thrown back to the [Worker-7] Job. The job will then terminate releasing the Project scheduling rule. This will allow [Worker-13] to aquire the Workspace Root scheduling rule and will allow it to finish loading the model. If [Worker-7] runs again, it will likely succeed. (Since, probably - the model was already let go, it'll just load the model itself...) But - even if the [Worker-13] job was still using that model, [Worker-7] will be able to get a hold of that exact same model because it's already loaded. After applying the patch and setting the timeouts, have you been able to reproduce the deadlock? (In reply to comment #23) > (In reply to comment #21) > The intention of the patch was to fix the deadlock -- not to completely > remove holding locks while calling into FileBufferModelManager. The fix for With your patch, consider what will happen if FileBufferModelManager.connect tries to take a rule lock that conflicts with a loaded waiter on the same SharedObject. Same problem as before: the waiter may never enter the waitForLoadAttempt method and the deadlock breaking code will never get executed. The advantage of using the loadLock the other way around is that it doesn't matter what the loading thread does while holding it's sharedObject lock; waiters will only contend with each other for the loadLock so the deadlocking breaking code will always have the opportunity to execute. (In reply to comment #24) The only lock held when calling into FileBufferModelManager.connect is sharedObject.loadLock. It will on occasion take the WorkspaceRoot rule (like Worker-13 shows in your original stack). If this rule conflicts with a rule that a waiting job has, that is ok. The waiter's waitForLoadAttempt is synchronized on sharedObject (not sharedObject.loadLock). The waiter can and does enter waitForLoadAttempt() and will timeout. Only actual loaders hold the sharedObject.loadLock. Waiters and all other exclusive (and with this patch are never-blocking) operations use sharedObject. (In reply to comment #25) > (In reply to comment #24) > The only lock held when calling into FileBufferModelManager.connect is > sharedObject.loadLock. It will on occasion take the WorkspaceRoot rule (like Consider the following scenario using 3.1 patch: Thread 1: 1) getModelForRead(IFile):1441 2) _commonGetModel(IFile, ReadEditType, String, String):505 3) _commonGetModel(IFile, String, IModelHandler...):532 4) _doCommonGetModel(IFile, String, SharedObject, ReadEditType):570 Locks held: SharedObject, SharedObject.loadLock 5) _initCount(SharedObject, ReadEditType):701 Locks held: SharedObject, SharedObject.loadLock 6) FileBufferManager.connect().. Thread 2: Locks held: workspace rule 1) getModelForRead(IFile):1441 2) _commonGetModel(IFile, ReadEditType, String, String):505 3) _commonGetModel(IFile, String, IModelHandler...):522 4) waitForLoadAttempt <blocks awaiting SharedObject taken in 4), never enters, *or* blocks inside the wait because, although there is a timeout, it can never reacquire the SharedObject monitor and get rescheduled. If at 6) connect does anything that takes a rule lock, now or in the future, then we again have deadlock. Ahh - I see now. Stay tuned for a new patch. Created attachment 145821 [details]
Patch to use private condition lock, instead of load lock (3.1)
In light of your latest comments, I think your original suggestion is the best way to go. I've updated my patch to use a private "condition lock" that is only used to receive the setLoaded() notification.
Created attachment 145825 [details]
Patch to use private condition lock, instead of load lock (3.1)
Oops. notifyAll() on wrong object.
Created attachment 145829 [details]
Patch to use private condition lock, instead of load lock (3.1)
Double oops. wait() on wrong object.
(In reply to comment #30) > Created an attachment (id=145829) [details] > Patch to use private condition lock, instead of load lock (3.1) > Double oops. wait() on wrong object. This looks good. I have patched and will retest this afternoon. Thanks. (In reply to comment #31) > This looks good. I have patched and will retest this afternoon. Thanks. Do you have any results running with the patch yet? Does it meet your needs? (In reply to comment #32) > (In reply to comment #31) > > This looks good. I have patched and will retest this afternoon. Thanks. > > Do you have any results running with the patch yet? Does it meet your needs? Hi Min. Yeah, the patch looks good. I have final task to integrate it into our product, but I have tested it sufficiently that I think it's safe to move toward committing it. Thanks. We are waiting for this patch to be applied to WTP 3.1.1 release to integrate and test with the adopter product. Created attachment 146449 [details]
Patch to use private condition lock, instead of load lock (3.1)
Minor tweak, changing the importance-order of the value-lookup for timeout and interruption.
New order:
default-default < instanceScope < configurationScope < systemProperty < envVar
*Explain why you believe this is a stop-ship defect. Or, if it is a "hotbug" (requested by an adopter) please document it as such. Requested by adopter *Is there a work-around? If so, why do you believe the work-around is insufficient? Not without making significantly larger changes elsewhere in the adopter code, and even then it wouldn't be 100% solved. *How has the fix been tested? Is there a test case attached to the bugzilla record? Has a JUnit Test been added? No test cases for reproducing, but *very* detailed explanation of scenarios in the comment history. *Give a brief technical overview. Who has reviewed this fix? Provides mechanisms for letting the ModelManager timeout when attempting to load/create an IStructuredModel, which would hopefully only happen during lock starvation. Reviewed by Min Idzelis, Cameron Bateman, and myself. *What is the risk associated with this fix? Low, as we suspect the fallout from timing out will be preferable to a deadlock. Even then, SSE will not enable this timeout by default (keeping it consistent with 3.1.0) and will rely on adopter code to do so on their behalf. I approve for 3.1.1. Released to 3.1.1 I don't know why I didn't notice this before, perhaps I was confounding the different ways to set the time out value, but in my final sanity test on this I discovered that the preference option is broken. That attached patch shows a single line change that adds a return keyword to the routine that gets the preference manager value. We consider this issue blocking on 3.1.1. Created attachment 147495 [details]
Add a return statement without which the pref value never gets used.
I suppose a "value=" in place of the return would be cleaner. Created attachment 147502 [details] patch based on comment 41 Good catch. I'm attaching a patch based on comment 41. I'm releasing the exact attached patch for a 3.1.1 build preemptively, but re-requesting a PMC review just in case. Retagging R3_1_1 as well. Please verify http://build.eclipse.org/webtools/committers/wtp-R3.1-M/20090917225226/M-3.1.1M-20090917225226/ quickly :) Created attachment 166062 [details]
Patch ibackported to 2.x stream based on 3.1 stream
Created attachment 166063 [details]
Plugin for testing with the patch for stream 2.x
|