Community
Participate
Working Groups
When inside of a Job, you issue a beginRule(), you try to acquire that rule. If that rule is currently busy, the job will spin waiting for that rule. If the thread that currently owns that scheduling rule decides to issue IJobManager#transfer to the Job's thread, the Job does NOT detect that it now has the new scheduling rule causing it to keep spinning. This scenerio should be supported. See related bug 283446.
Very important update. I did not phrase this exactly how I meant it. I did not mean that a *thread* can not transfer a rule to Job that is trying to acquire it. I meant that if a *different* Job wanted to transfer its rule to another Job, it will not work. I've updated the description. Flow: Job 1 (no rule) acquire A (spins) Job 2 (rule=A) transfer A to Job-1 Job-1 will not realize it has the rule A and will keep spinning. Note: This violates some of the documented API. transferRule: "The identical scheduling rule must currently be owned by the calling thread as a result of a previous call to <code>beginRule</code>." In the case of Job running with a rule, it did not ever call "beginRule" explicitly.
Just pinging this to see to request that this be committed to be fixed in Eclipse 3.6. I also want to call attention to bug 287096 which depends on this bug. Several adopters (Oracle/IBM) are hitting issues that can be fixed if this bug is fixed.
Created attachment 146205 [details] Test case Two tests illustrating transfer between jobs where rule was acquired via beginRule (passes), and when the rule was acquired automatically when the job started (currently fails).
These are mainly notes for myself... I spent the afternoon on this but haven't found a working solution so far. There are many cases to handle: Source thread: - Job with rule - Job with rule that has done nested beginRules (have to transfer rule stack) - Job with no rule that did beginRule (this currently works) - Simple thread that did beginRule (currently works) - Simple thread with no rule Destination thread: - Job with rule (should fail) - Job with no rule that did beginRule (should fail) - Job with no rule - Simple thread that did beginRule (should fail) - Simple thread with no rule (currently works) Several of these cases involve a mixture of "explicit" job rules (Job.setRule()), and "implicit" job rules (IJobManager.beginRule). Explicit job rules are currently handled in JobManager, and implicit rules are handled in ImplictJobs class. Each of these classes has separate data structures and locks, which is making it difficult to handle without risking deadlocks.
Also to clarify behaviour this is the flow I currently see happening: Job 1 (no rule) acquire A (spins) Job 2 (rule=A) transfer A to Job-1 -> This transfer is illegal so throws an exception, causing Job 2 to end and release the rule, which is then acquired by Job 1. So I don't actually see an infinite wait here but I can see it's not the desired behaviour. Another wrinkle to consider is whether the transfer should persist after the jobs in question have finished. I.e., this is the behaviour you would see: job1.getRule() -> "SomeRule" job2.getRule() -> "null" jobManager.transferRule(SomeRule, job2.getThread()); job1.getRule() -> "null" job2.getRule() -> "SomeRule" It seems natural that the transfer would persist after job completion rather than magically transferring back when the job completes. However this could be very surprising to the job implementor if they weren't the party that performed the transfer. Suddenly their job would no longer have any rule for subsequent runs.
Created attachment 146233 [details] Partial fix This is a work in progress that only handles some of the cases. This is not a complete fix.
This might be a half-baked idea, but I think it will also solve the problem. The point you bring up in comment 5 is a good one. What should happen to the rule if the Job that it is transferred to ends? I think we can side-step that issue by going about this another way. What if we introduce the concept of "yielding" a Job? It should only be allowed to be invoked from within the run() method of the Job. (So only the calling Job would have this authority). yield() would basically put the Job back in "waiting" state. Since the Job would be waiting again, it would temporary release its explicit (and implicit) rules. These rules would then be available for any other Thread to begin those rules or even other Jobs with the same rule to start. When those Jobs/Threads finish, they would release the rules and the yielded Job could start execution again at the point after the yield() call. The concept of yield would mean that you don't have to fiddle around with transfer of rules to another thread and then transfer them back or end them. It also neatly avoids the thorny issue of a Job having its rule disappear without it knowing. How does that sound? Btw, if a Job that performed a beginRule() finishes its execution, the rule is automatically ended. A log message is output: !ENTRY org.eclipse.core.jobs 4 1 2009-09-02 00:17:00.612 !MESSAGE Worker thread ended job: (371), but still holds rule: ThreadJob((371),[R/,])
Created attachment 150413 [details] Implementation of yield() on Job
Created attachment 150414 [details] Test code
Added my initial stab at implementing Job.yield(). I have also added tests for all sorts of mixtures between yielding jobs and threads with combination of implicit or explicit rules active. Note: this implementation does not release ILocks or Java-level locks, the later of which would be impossible. I'm eagerly waiting for any and all feedback. Writing this patch to introduce a new capability to the JobManager was interesting - dare I say... a bit fun. ;-) Please let me know if you have any questions on any decisions I made or some little re-arrangements I needed to make with regard to lock ordering inside the JobManager. As I was looking around I found some other areas that might need a little improvement. One thing I don't really like is the timed wait() in ThreadJob#joinRun(). There may be a way to do it without the timed wait. Some of the documentation could use a little improvement. Would you be interested in a separate patch for some doc improvement?
(In reply to comment #10) > Added my initial stab at implementing Job.yield(). I have also added tests for > all sorts of mixtures between yielding jobs and threads with combination of > implicit or explicit rules active. Thanks Min, I probably won't get a chance to look at it in detail until next week but from my quick look it looks promising. > As I was looking around I found some other areas that might need a little > improvement. One thing I don't really like is the timed wait() in > ThreadJob#joinRun(). There may be a way to do it without the timed wait. I don't like it either. It wasn't like this originally, but due to some crazy cases involving recursive reentrant waits it was the only solution I could find that didn't deadlock or respond to notifications at the wrong time (and then livelock due to lack of a notification). Bug 41925 is where I originally switched to a timed wait, and bug 50949 is another similar reentrant case. > > Some of the documentation could use a little improvement. Would you be > interested in a separate patch for some doc improvement? Sure.
Created attachment 150662 [details] Implementation of yield() This is my latest patch. It fixes some race conditions that happened when yielding() was racing with beginRule() on a different thread. I fixed this by combining the implicitJobs's notify monitor with the global job manager lock. This makes a lot of sense to me, since you don't have to worry about dealing with two locks at once. It also reduces the need to release/acquire the global locks so often which also reduces lock contention. The other fix is that getThread() was not restored after yield(). A regression test was added for this. Note: this patch includes changes for the following bugs: bug 293308 restore interrupts in ILock#acquire() -- straightforward bug 258986 spin loops (removed time-wait) -- All ui and non-ui tests pass. To revert, just add a 250ms timeout. Also note that I can not automatically throw an OperationCanceledException when the wait() is interrupted as I originally wanted. This is because the joinRun() could be executing in the UI thread. If a syncExec comes along, the UISynchronizer will interrupt the UI every 1000ms. We don't through the OperationCanceledException until cancellation is requested via the ProgressMonitor. Interrupts are properly restored. This is actually good - because we are no longer spinning every 250 millis, we're also not spinning the EventLoopProgressmonitor. The UISynchronizer ensures that events like syncExecs() are still processed. Last note. A monitoring thread is installed that checks the isCanceled() status of the progress monitor. If it happens to be the EventLoopProgressMonitor, exceptions are thrown because the ELPM doesn't check to see if the thread invoking the runEventLoop() method doesn't check to see if the thread is the UI thread. Including in this patch is a small check to prevent this. bug 265324 Deadlock with ILocks possible due to improper cleanup of ILocks not released when Job ends -- Just added a warning log, do not remove locks (yet)
Created attachment 150664 [details] Test code for yield()
(In reply to comment #12) > This is my latest patch. It fixes some race conditions that happened when > yielding() was racing with beginRule() on a different thread. I fixed this by > combining the implicitJobs's notify monitor with the global job manager lock. > This makes a lot of sense to me, since you don't have to worry about dealing > with two locks at once. It also reduces the need to release/acquire the global > locks so often which also reduces lock contention. The global lock is released/acquired very often because we can't hold it while calling third party code due to deadlock risk. If we hold this job manager lock while calling a third party, then any locking they might do cab quickly lead to deadlock. For example in your patch you now call LockListener while holding the job manager lock, which will likely cause deadlocks (UILockListener processes sync execs which can run arbitrary code). There may be other places in your patch where you introduce this problem and overall combining these unrelated locks doesn't seem like a good idea - using a single lock for unrelated things often leads to trouble (job manager lock is for protecting JobManager class data structures, and the ImplicitJobs lock is used for wait/notify purposes). > Note: this patch includes changes for the following bugs: Please resist the temptation to lump unrelated fixes into the patch. It just makes the patch harder to review and accept. Other contributions are most welcome but much easier to track separately.
(In reply to comment #14) > The global lock is released/acquired very often because we can't hold it while > calling third party code due to deadlock risk. If we hold this job manager lock > while calling a third party, then any locking they might do cab quickly lead to > deadlock. For example in your patch you now call LockListener while holding the > job manager lock, which will likely cause deadlocks (UILockListener processes > sync execs which can run arbitrary code). There may be other places in your > patch where you introduce this problem and overall combining these unrelated > locks doesn't seem like a good idea - using a single lock for unrelated things > often leads to trouble (job manager lock is for protecting JobManager class > data structures, and the ImplicitJobs lock is used for wait/notify purposes) Good catch. It was not my intention to hold the lock while calling foreign code. Point taken and my new patch fixes this. I could only find one additional place where the lock leaked out which when calling into the IProgressMonitor. (Also fixed). > > Note: this patch includes changes for the following bugs: > > Please resist the temptation to lump unrelated fixes into the patch. It just > makes the patch harder to review and accept. Other contributions are most > welcome but much easier to track separately. Ok. The latest patch removes other unrelated patches. Its not always possible to create independent patches because the patches touch the same files. I'm ok with doing it lock-step, but I also wanted to get some exposure to the other parts of the patch.
Created attachment 150771 [details] Latest implementation of yield() Just uploading patch now. Will provide detailed explanation of changes in a follow up comment.
Oh, I forgot to mention that all UI and non-UI tests pass, as well as a quick ad-hoc test of various Eclipse activities. Please assume that all my patches revisions from now on are tested thoroughly before I upload them.
Explanation of the patch. Yield is implemented using a new internal state YIELDING, which for api purposes appears like WAITING. Yield() returns a boolean if it actually ran another job due to internal race conditions that did not seem necessary to avoid. The race condition I'm talking about is that scheduling two conflicting jobs back to back may cause the yield() to invoke the 2nd job. That is because during schedule a NONE job must shift into WAITING. A worker thread must be woken up before WAITING can turn into BLOCKED. When yield() is invoked, and 2 conflicting jobs were scheduled back to back, the 2nd Job may not have transitioned to BLOCKED yet and so yield() immediately returns. Instead of coming up with a way to synchronize all this, I simply return a boolean. The test code spins until yield returns true. Alternatively, I could have provided a "wait until blocked" facility, but that seemed to be of very limited use and would expose an internal state. Yield() needs to be able to yield to both blocked explicit jobs and implicit jobs. Explicit jobs were easy, because the jobs that are blocked on a running job are linked to the running job in a queue. Implicit jobs required a bit more work. Waiting and Running explicit jobs where tracked by the JobManager. Only running Implicit Jobs were tracked. I added a new map of waiting implicit jobs. This map is updated during joinRun(). After I had a list of waiting implicit jobs, I was able to iterate the job list looking for a job that conflicted with the yielding job, and this job is the job that I will yield to. In either case, the conflicting explicit or implicit job is "picked" as the job yield() will wait before returning. After we have picked a job, I change the state of the current job to YIELDING. The new state was required because the WAITING state puts things into the waiting queue which is then serviced by worker threads. YIELDING does not do anything, but it does run the appropriate transition actions. (Unblocking jobs) Now we must wait until that job is done. Previously, the global job manager lock was only used as a mutex and was not used for any conditional waits. I have added signaling to the manager lock. That means I can wait until a certain condition is met. The condition I'm waiting for is for my "picked" job to transition into any state except WAITING. Any time a state transition is performed, a signal is issued on the global lock. This will wake up thread that is waiting on this lock. So the "wait until not waiting" loop is performed, which gives up the global lock which allows worker threads to get the next waiting job (the "picked" job) to run. Since that job's state becomes ABOUT_TO_RUN, the transition wakes up the yielding thread, which moves on to the next phase of "wait until I am able to run again". Since it is not able to run right way, it goes into wait() immediately. After the picked job is done, another state transition is performed which wakes up the yielding job. This yielding job is now able to run, so a state transition is performed back into RUNNING state and it's active Thread is restored. A note about implicit jobs. They were a little tricky. I needed a sure fire way to hand off the yield() job to an implicit job. Previously, ThreadJob did have an internal lock that was used strictly for notification purposes. This was just a "helper" to abort-early (under some circumstances) the 250ms timed-wait that was present in the ThreadJob.joinRun method. This timed wait in the joinRun is not ideal because it presents a lot of contention issues. The joinRun actually needed to acquire the global lock 2 times every loop iteration. Since it was waking up every 250ms, it was acquiring the global lock very often. This could significantly slow down other simultaneous unrelated jobs. We could reduce the number of times the global lock is acquired by surrounding both methods that acquire the lock in a single synchronized block. But now, the notification lock would not release the global lock. Aha - but we don't need this notification lock anymore. Why not just use the same global lock notification mechanism that we are using for similar reasons in the yield() method and state transitions. For a joinRun to successfully return, a state transition needs to be performed to allow the joinRun to proceed. The state transition already notifies the global lock so this is perfect fit. But new wrinkles popped up. Things are never that simple, right? It turns out that this timed wait had 2 additional purposes besides just waiting until the job is able to run. 1) It checked cancellation (and via a creative hack, this sometimes ran the display's eventLoop if it happened to be on the UI thread) and 2) it periodically invoked a lock listener via lock manager. The solution for #2. The lock listener is foreign code and can invoke anything. We must not hold the global lock when calling the lock listener. A compromise is made. For every new blocking job that we fine, we will call the lockManager's aboutToWait() just once. It should be enough to invoke it once if the method is consistent. (returns the same output for the given input - i.e. is not random) remember - any state transition (even for unrelated jobs) can wake up this joinRun method. So, as long as we realize that "nothing happened" - i.e. we're still blocked by the same job we were before - we don't invoke the lock listener. This prevents us from having to acquire/release/acquire/release the global lock every time we wake up. Now, we will only acquire/release the lock once. The solution for #1. If we are in a wait(), we can't be polling the IProgressMonitor. That means we're uncanceled while waiting - which is unacceptable. I added a new internal thread to Job Manager. This job runs every 250ms (but could be even faster now, since it does less work) and invokes isCanceled() on the monitor. If the monitor is canceled, it interrupts() the thread that is wait()'ing. This causes the loop to wake up and notice that the monitor is in deed canceled, and so the OperationCanceledException is thrown. - Ah, another wrinkle. What about that creative hack that runs the event loop when you call into the progress monitor? I'm no longer invoking the monitor on the same thread that created the monitor. So the event loop can't be run. How to fix this? Well, the BEST way I discovered was to add a new method to lock manager/listener. shouldFork(). The lock listener can decide if its ok to "block" the calling thread. The default is false. But if the UILockListener is present, it should return true when the calling thread is the UI thread. This means that the I will create a thread that will perform the wait until the thread job can successfully return. Then I will block until that thread job is finished. Since I'm running in the UI thread, I check for cancellation of the monitor and propagate that to the thread job. I also only wait for a max of 250ms, so that I call the isCanceled() on the progress monitor within the UI thread, which allows it to spin. One LAST wrinkle. This loop in the UI thread must notice when rules are transferred to it. Because, for all intents and purposes, other clients assume that the beginRule is happening "in the context of" the thread that invoked it (even though it really isn't). So, if a rule is transferred to the UI thread, the loop will notice it. It will attempt to shutdown the thread it spawned, and if that is successful will return the ThreadJob that was unblocked. Whew! I ran all the UI tests. I noticed that some tests are still unresolved. Namely the test for bug 105491 and there are a few others in the same vein. All the issues that are not solved as still not solved. ;-) So, this should be a bug-compatible change. Speaking of which, it seems that issues like bug 105491 and its kin could be solved by implementing ModalContext using ILocks or Jobs that do the transfer of rules automatically. I realize the reason this is probably not done is because ModalContext is part of jface which doesn't have a dependency on jobs. However, maybe that could be worked around similarly to what JobManager does with UISynchronizer and UILockListener. That is - could ModalContext provide an execution hook. When JobManager starts up,it could register that hook with JFace? (This is a half-baked idea for now...) Note: While writing this I noticed a possible undesired side effect. When yielding to an implicit job, the job that I "pick" to wait for may not be the next implicitJob that runs. That is because all a random conflicting jobs is chosen to wake up when the yield notifies. Eventually, the "picked" job will run, but other jobs may have run before that. It is not true for explicit jobs, because the blocked jobs have a specific order. After yield() notifies an explicit job it is guaranteed that the picked one will run next.
I'm having trouble seeing how the switch in ThreadJob.joinRun from a timed to untimed wait is related to the yield implementation. It seems to avoid this 250ms wait you now wake up all waiting thread jobs on every job state transition. There can be a large number of job state transitions in a second, so I think this could easily produce even more contention than the old code. You'd be amazed how often job state transitions occur (try putting a println in JobManager#changeState and run the workbench). You also spin up a new thread each time a wait occurs in the UI thread, which is heavyweight. Couldn't WaitForRunThread be a job? I'm still working my way through the patch and need to spend more time with it. Overall it seems quite complicated so I'm trying to sort out what changes might be unrelated or unnecessary so I focus on the core new functionality.
(In reply to comment #19) > I'm having trouble seeing how the switch in ThreadJob.joinRun from a timed to > untimed wait is related to the yield implementation. It seems to avoid this > 250ms wait you now wake up all waiting thread jobs on every job state > transition. There can be a large number of job state transitions in a second, > so I think this could easily produce even more contention than the old code. > You'd be amazed how often job state transitions occur (try putting a println in > JobManager#changeState and run the workbench). Fair points. It should be possible to just use notify() instead of notifyAll(). The change then would be that one waiter would wake up and if it had no work to do, it would need to propogate the notify() before it fell back asleep. This needs to work with interruption, because the thread that is notified might have been interrupted. That means that when interrupts are handled, notify() also needs to be performed. Would you like me to go ahead and make that change? Or we could wait and try to measure if this really is an issue first. You also spin up a new thread > each time a wait occurs in the UI thread, which is heavyweight. Couldn't > WaitForRunThread be a job? Hmm. Perhaps, but it would be a little heavy. Scheduling a job requires multiple lock acquisitions, and also notifies listeners. Nobody should ever know about this very special "job." Ideally, we'd just get a thread from the WorkerPool and put it back when we're done. That will probably require changes in the workerpool so that it cooperates with jobs. > I'm still working my way through the patch and need to spend more time with it. > Overall it seems quite complicated so I'm trying to sort out what changes might > be unrelated or unnecessary so I focus on the core new functionality. I really think removing the timed wait is necessary to ensure a transition from the yield() to a waiting explicit thread.
(In reply to comment #20) > > I really think removing the timed wait is necessary to ensure a transition from > the yield() to a waiting explicit thread. I take that back, it is not absolutely necessary if a timed-wait is used. I think you can safely remove that part from the patch. If you want we can deal with removing the timed-wait as a separate bug/patch. Would you like me to rework the patch and add a new bug to track the timed-wait? I still think it would be a very good thing to do. You might get a lot of state transitions back to back, but there may be large periods of time with no state transition at all. During that time, the beginRule() would be contending against the lock repeatedly. Especially if a large number of threads were performing beginRule on the same object.
Created attachment 151612 [details] Cosmetic changes Changes from previous: - Changed EE to Foundation 1.1/J2SE 1.4 to allow using RuntimeException(String,Throwable) constructor that's not available in Foundation 1.0/J2SE 1.3 - Incremented bundle version to 3.5 - Changed @since on new methods to 3.5
I'm still bothered by this notifyAll on every state change. It seems you should only need to notify when changing *out* of the running state. At this point there are people waiting that might need to be woken up. I think the only reason you need to notify for all state changes currently is because of the middle part of the yield (wait until the job I picked is not waiting). I'm wondering if that middle part can be replaced by a simple join(pickedJob). I.e., block until the picked job is done rather than until it starts. That would make the yield() implementation a bit simpler, and allow us to only notify the lock when exiting the RUNNING state. In fact we already have the notifyWaitingThreadJobs() method that is called whenever somebody is waiting should be woken up (i.e., a job is done or someone is releasing a rule). That should be sufficient to wake up the waiting yield without having a notifyAll() at all in the changeState method.
Created attachment 151867 [details] v4 (new technique, wake waiting threads precisely)
(In reply to comment #23) > I'm still bothered by this notifyAll on every state change. It seems you should > only need to notify when changing *out* of the running state. At this point > there are people waiting that might need to be woken up. I've addressed your concerns by adding the field jobStateChange in InternalJob. No longer will the global lock be notified on every state change. This lock will be held and notified while the state changes for a specific job. Only the yielding/waiting thread jobs blocked by a job will be woken. This will reduce contention when unrelated jobs perform state transitions. Note, it must be *any* state transition, not just from running. A waiting job can be canceled and never run. That is why yield waits until the unblocked jobs is not waiting anymore. > I think the only reason you need to notify for all state changes currently is > because of the middle part of the yield (wait until the job I picked is not > waiting). I'm wondering if that middle part can be replaced by a simple > join(pickedJob). I.e., block until the picked job is done rather than until it > starts. That would make the yield() implementation a bit simpler, and allow us > to only notify the lock when exiting the RUNNING state. Yes, join is basically what were doing there. However, existing join code can not be used because it checks to see if the job is running. In our case, the job may still be waiting. The first step in yield is wait for the job to exit waiting, then try to start the job again. As an aside, the existing join implementation could be enhanced to use the same technique. Instead of adding a listener, you could simply wait() on the jobStateChange. This may completely eliminate the need to sync on the global manager lock. In fact, by having a per-job jobStateChange lock, the global lock can be removed in several other places. Before I start to ramble, I believe we can get rid of the global lock completely and the thread contention that happens when many threads are scheduling jobs at once. Right now there is one waiting queue and all the (possibly hundreds) of workers need to synchronize on it. We could create queues per worker thread. A schedule would then just need to synchronize on a worker thread. This means that potentially [workerthreadpoolsize] schedule ops could be performed simultaneously. Additionally, if a job is rescheduled, it would just re-add it to the worker's run queue. But, I digress. > > In fact we already have the notifyWaitingThreadJobs() method that is called > whenever somebody is waiting should be woken up (i.e., a job is done or someone > is releasing a rule). That should be sufficient to wake up the waiting yield > without having a notifyAll() at all in the changeState method. I added a parameter to notifyWaitingThreadJobs that takes a specific job to notify which is more precise. I also clarified the lock acquire protocol listing the order in which locks should be acquired. I changed the name of the WaitForRunThread to indicate that it is a worker, and what the source thread is waiting for. Also fixed a small bug with cancellation. When a job was canceled via the blocked jobs dialog, the exception was not tricked out of the joinRun. This change was in WaitForRunThread#createException, and in ThreadJob#waitForRun, after the wait times-out, we need to check if the waitforthread had finished first. This last patch incorporates the changes in your latest patch.
This is getting very close. A couple of comments: 1) ThreadJob#waitForRun still grabs the JobManager lock, but I'm not sure it's still needed. The comment says it takes the lock because it is used for notification, but this is no longer the case. It would be great to be able to make JobManager.lock private again and not refer to it here. There is a small window between calling manager.runNow(this) and manager.findBlockingJob(this) where the blocking job could end, and findBlockingJob(this) would then return null. If we handle that case then I think we can get rid of the sync on JobManager.lock here: blockingJob = manager.findBlockingJob(this); if (blockingJob == null) continue; 2) ThreadJob#joinRun has a for loop inside "synchronized (runner.getNotifier())". This loop calls foreign code such as monitor.isCanceled and monitor.setCanceled. I'm not sure there is a deadlock risk here but large synchronized blocks with foreign calls like this worry me, especially since I think it's not needed here. I will attach a new patch that is somewhat simplified with a smaller locking window.
Created attachment 152217 [details] Fix v5 Change to ThreadJob#joinRun described in previous comment, and various trivial doc and style changes.
Two other small comments: The implementation of JobManager.yield starts by calling Thread.yield(). I don't see the point of this as the job hasn't yet relinquished any lock at this point. I don't think Thread.yield() needs to be called at all in this method - there isn't really any relationship between the contract of Thread.yield and Job.yield. In fact I'm tempted to call your new method something like yieldRule() to help avoid any confusion. The JobManager class comment say the lock order is: JobManager.lock -> JobManager.implicitJobs However, JobManager.yield() does this: synchronized (implicitJobs) { synchronized (lock) { I don't see why the outer sync block is needed here - just holding the JobManager.lock should be sufficient I think.
Thanks for making the joinRun code more readable - much cleaner! I think we're very close too. I'm ok with all your latest changes. I'm going to attach my latest version that does the following. 1) Made JobManager.lock private again. -- I'm ok with this as long as we change JobManager.runNow to return the blockedJob it found (instead of a boolean.) The reason why I synced on JobManager.lock in ThreadJob was to reduce the # of lock acquires - to reduce possiblity for contention. This is also more efficient, since we don't have to invoke findBlockingJob twice. By having this method return the blocking job it found, we don't need the jobmanager.lock to be visible. 2) Nice catch on calling isCanceled() within a lock inside of joinRun! I'm totally in agreement with this change. 3) Clarify Javadoc for lock acquisition rules. The javadoc was wrong. You must acquire locks in this order: WorkerPool -> JobManager.implicitJobs -> JobManager.lock -> InternalJob.stateChangeLock. -- This is because at various points implicitJob will call into JobManager's lock. (i.e. implicitJob.begin() will call into manager.currentJob() which needs the lock. Since yield calls into implicitJobs and must hold the global lock, the implicitJob lock must be acquired first. -- If you like, you can remove the implicitJobs sync block from yield() and run testYieldJobToThread and it will show deadlock. 4) This change you may find somewhat questionable. I made ThreadJob#waitForRun NOT try to look for transferred rules to the running thread if it is running via a private thread. Nobody outside of Job framework "knows" about this private thread. Since the JobManager & friends do not transfer rules to this private thread ever, this check is not necessary. It might improve readability. The performance cost of this check is probably negligible. So, I'm either way - include this change - or not. 5) Why did I include Thread.yield()? Well - I imagine most real clients of the JobManager.yield() method will have other means of ensuring that the yield() will ultimately run the specific blocked job the client is interested in. This is because yield() does not guarantee that a specific previously blocked job (in the case of multiple) runs. For some clients that do not care which other job runs - and that may be a rare scenario - it might make sense. That's because in scenarios like the test: testYieldJobToJob, the yield is part of a spin-loop. It's probably better to spin a Thread.yield() than a real-spin loop, though both are not ideal. It also assists - a bit. Scheduling two jobs back to back will put them both into waiting state. The first job may start running, and still find the other job in WAITING state. It actually takes a worker thread to change the state from WAITING to BLOCKED. Since this can't happen if yield() holds the JobManager.lock, Thread.yield() is issued to help the worker get a hold of the lock so it can set the state to BLOCKED. Arguably, it's probably better that the calling code should perform the Thread.yield() rather than the yield(). I'm ok with removing the Thread.yield(). Final note: One thing I'm still not super-happy about is that the spinning thread in joinRun calls manager.findBlockingJob every 200ms. That means lock acquires - possibly leading to lock contention. However, this should be pretty rare and I'm going to let sleeping dogs lie on this one.
Created attachment 152347 [details] Fix v6
Created attachment 152348 [details] Associated fix for UILockListener in org.eclipse.ui
Comment on attachment 150664 [details] Test code for yield() These are tests for yield(). They apply to all fix versions.
I think the javadoc for
Oops, somehow submitted that previous comment in the middle of typing. Please ignore previous comment. I think the javadoc of LockListener#shouldFork is backwards (or I am misunderstanding). The @return statement says "return <code>true</code> if this thread can block". To me if "shouldFork=true" it implies that this thread *cannot* block (and therefore we must fork). It seems like it should say something like "return true if this thread cannot enter a blocking wait and therefore any waiting must be forked into a separate thread" On the other hand maybe the whole method will make more sense from the client's perspective if it was called something like "isBlockingWaitAllowed" or "canBlock" rather than "shouldFork" (with current return values inverted). Somehow "shouldFork" doesn't give the reader enough information to understand its meaning (should fork what? when? why?). I had to stare at this code for quite awhile to understand it, and that may be due to the javadoc but I think the method name could be clearer. What do you think, does this name make more sense or am I completely missing something?
(In reply to comment #34) I choose the word "fork" to be consistent with regard to IRunnableContext that refer to threads as "forks". But, it is a little confusing. I'm ok with canBlock() or if you like, isBlockable(). I'm also ok with either: yield() or yieldRule() for the main method. There is some precedent for yield(), see org.eclipse.emf.transaction.util.Lock#yield(). As long as the javadoc is clear, I think it doesn't matter what we choose for the name there.
Created attachment 152439 [details] Fix v7 I have made those renames. I think "yield" is a good name but I just worry people will confuse it with Thread.yield since Job and Thread have many similarities. I think Job#yieldRule will encourage people to read the contract of the method a little more carefully before calling it (hopefully!). I did some other minor cleanup - you seem to have a formatter option that wraps comments, leading to odd wrapping with comments that have a single word wrapped to the next line.
I just hit a deadlock when running the test suite. Thread 1 was here (JobManager.yieldRule): synchronized (implicitJobs) { synchronized (lock) { Thread 2 was in JobManager.endRule: synchronized (lock) { implicitJobs.end(rule, false); Note the lock order is reversed.
(In reply to comment #37) I fixed the deadlock. While designing the patch, I went through a few iterations before I was able to decide on the proper lock ordering. This one is a remnant of one of the lock orderings I was trying. This one slipped through. Luckily, add we need to do is removed the synchronized block from JobManager.endRule (which happens to be the code currently in HEAD, so that's nice.) I went through and made sure all the locks were properly acquired. I noticed one other place where locks were flipped, and that was in the yieldRule() implementation. Must acquire implicitJobs before JobManager.lock, and the way it was code was previously structured it flipped some things about. I went through and added some documentation. I added @GuardedBy() doclets (instead of annotations) that document which locks must be held when calling methods and fields. This is a convention that's used in Doug Lea's Concurrency in Practice, (http://www.javaconcurrencyinpractice.com/annotations/doc/net/jcip/annotations/GuardedBy.html) and helps get across the intention of some of the design decisions. I also added some doc to ThreadJob#joinRun which I think will help with future maintainability. While documenting joinRun, I discovered that cancellation when canBlock == false did not properly work. That is because the isCanceled() will just break the while loop. Really, it should have thrown OperationCanceledException, since the worker thread's progress monitor is private, and isn't the one that was canceled. I made this change as well. I also removed the (now) unused ThreadJob.notifier field. I also removed volatile from ThreadJob.isWaiting. It is actually completely GuardedBy("jobStateLock"). I'm not sure what formatter options you use, so the javadoc should probably be run through a formatter again.
Created attachment 152540 [details] Fix v8
Patch v8 released in HEAD. Thanks for all your hard work Min!
(In reply to comment #40) > Patch v8 released in HEAD. Thanks for all your hard work Min! Woohoo!! Thanks! It was a pleasure working with you.
I get a compile error in HEAD, since RuntimeException(String, Throwable) is not available in Foundation 1.0: Description Resource Path Type Location The constructor RuntimeException(String, Throwable) is undefined WaitForRunThread.java /org.eclipse.core.jobs/src/org/eclipse/core/internal/jobs Java Problem line 62 If the change in MANIFEST.MF is OK, then you should also update the 'JRE System Library' on the build path and update the project plan.
(In reply to comment #42) > I get a compile error in HEAD, since RuntimeException(String, Throwable) is not > available in Foundation 1.0: That's strange, after we switched the MANIFEST.MF I stopped getting this error (comment 22). In any case I clicked "update classpath" and released the new .classpath file to HEAD.
With great regret I've had to revert this change in HEAD. After spending the last two days and part of the weekend chasing deadlocks and other failures introduced by this change, it's clearly not ready for consumption (see blocked bugs list for details). In particular bug 295861 suggests that locks are randomly failing to be released and I just can't figure out why. Also, I have to say after spending more time debugging in this code, it adds too much extra complexity to the rule acquisition code. In particular there are potentially two more threads involved each time a rule is acquired ("internal worker" and "wait for run thread"), that just make it so difficult to understand what is going on when a thread is waiting to acquire a rule. These changes aren't even strictly related to the yield implementation which makes it hard to justify the added complexity and risk of releasing it. I will attach a patch shortly that includes the cumulative changes to date made to support this, and I have also tagged the affected projects with v20091125 before reverting so we can get back to that state if needed.
Created attachment 153116 [details] Fix v9 This patch against HEAD includes all cumulative changes made so far to support this enhancement. This includes a couple of deadlock fixes that were made after this support was initially released.
Just an update to let you know that I'm working on a much more conservative patch. I've removed the WaitForRunThread because it was not providing any real benefits and just adding complexity. The resulting code for ThreadJob#joinRun very closely matches what was there originally. I kept the change of using InternalJob.jobStateLock as a notifier instead of a global notifier. Stay tuned for an updated patch. Also, I was not able to reproduce the problems in bug 295861. I ran the CVS (org.eclipse.team.tests.ccvs.core.AllTests) and the JDT JFace "CodeCompletionTest". I needed to set up a windows based CVS server for the team tests, and I was getting errors that seemed to be related to the windows server. Would it be possible to run the entire Eclipse test suite using the existing test infrastructure on a "test build" or on a patch I submit to this defect?
(In reply to comment #46) > Would it be possible to run the entire Eclipse test suite using the existing > test infrastructure on a "test build" or on a patch I submit to this defect? Yes, we can do that. It does tie up our test machine for several hours so I would likely run the CVS tests a few times locally on my own machine first, but using the CVS repository on the test server. The problem seemed easiest to reproduce on Mac but I also made it happen on XP (about 1 in 3 runs got the failure for me).
The updates I made since the last diff. Fixed deadlock when yieldRule() "ping pongs" between two threads. This involved restructuring the yieldRule method a bit: Needed to which jobs are being yielded so that they can be "unblocked". New helper methods were added to increase readability. New junit tests were added to test for this situation. Added additional restriction. yieldRule() asserts that it can not be invoked if LockListener#canBlock returns false. YieldRule does not perform the same kind of tricks that threadjob does when resuming - so it will not run the event loop or service syncExecs. Perhaps this can be added in future enhancement? Actually, I think I could potentially reuse some ThreadJob code. The more I think about this, the more I like it. I'll investigate if there is something that can be done here. ThreadJob.joinRun overhaul. Basically the original code now. The WaitForRunThread has been removed. It wasn't buying us anything anyways, because it was supposed to prevent contention when canBlock was false. Well, when canBlock is false, a whole bunch of stuff has to be done every 250ms anyways. This change allowed the same code to be re-used, the major difference being that canBlock determines if the wait will be infinite or timed. This code should be substantially less complicated. And much better documented. Both minor formatting changes, and added/updated javadocs, and inline docs. Moved findBlockedJob from ThreadJob to JobManager to facilitate reuse. InternalJob's result field was made volatile, since it can be written by Worker threads and read by client threads. Tests run AllTests ConcurrencyTestSuite JdtTextTestSuite org.eclipse.team.tests.ccvs.core.AllTests (but my win server has win32 specific failures) Perhaps you could run some tests on this code that were previously failing? This is pretty close now - but I do want to see what I can come up with with reusing threadjob code to resume the yieldRule.
Created attachment 153555 [details] v10
Created attachment 153788 [details] v11 This patch should be ready for your local testing. Please let me know if you find any issues and I'll fix them. I've added additional "ping pong" and yield/transfer rule interaction tests. Also some minor test harness changes. This implementation uses ThreadJob#joinRun to resume the yield, which reduces code and just makes it fit better. It also means it can be executed from the UI thread. Fix: JobManager.find() did not return yielding jobs. I've tested with IBM1.6, Sun 1.4.2. I've ran JobManager's AllTests as well as the ConcurrencyTestSuite. The tests for JdtTextTestSuite and "CVS UI Tests - org.eclipse.team.tests.ccvs.core.AllTests" are currently running, with no problems so far.
JUnits for CVS UI and JdtTextTestSuite passed. If you don't find any problems locally, could you please test the patch against the whole Eclipse JUnit test suite?
John, I noticed that you were taking the new methods yieldRule and canBlock as "Since 3.5." Was this intentional? I think including this in 3.5 would be great, but I didn't think that API could be altered in a maintenance release.
(In reply to comment #52) > John, I noticed that you were taking the new methods yieldRule and canBlock as > "Since 3.5." Was this intentional? I think including this in 3.5 would be > great, but I didn't think that API could be altered in a maintenance release. 3.5 is the current bundle version. See: http://wiki.eclipse.org/Version_Numbering#Which_version_to_use_in_javadoc_tags
(In reply to comment #53) > (In reply to comment #52) > > John, I noticed that you were taking the new methods yieldRule and canBlock as > > "Since 3.5." Was this intentional? I think including this in 3.5 would be > > great, but I didn't think that API could be altered in a maintenance release. > > 3.5 is the current bundle version. See: > > http://wiki.eclipse.org/Version_Numbering#Which_version_to_use_in_javadoc_tags Oh ok, I was confused. I hadn't realized that Eclipse plugins were completely disconnected from main Eclipse version. Since this plugin didn't change at all in 3.5, it was still at version 3.4. What about incrementing the version to 3.6, skipping 3.5? It doesn't seem contrary to version numbering guidelines. The guides don't say that the increment amount must be 1 each time. At least this way, the Eclipse version this plugin last changed in will align with the Eclipse version proper.
Just wanted to ping this bug after all the holidays are over. It would be great if we could run the whole Eclipse test suite on the v11 patch version.
Created attachment 158382 [details] Fix v12 Previous patch added a second waitQueueCounter in ImplicitJobs, which was confusing. Even though they are used for separate queues they can share the same counter since it's just needed to ensure entries are monotonically increasing.
The latest patch was released in HEAD on Saturday and no test failures or hangs were noticed. Marking fixed.
Comment on attachment 153788 [details] v11 This patch doesn't exactly match what was released, but it's very similar. Marking this patch so that Min gets credit for it in the IP log.