Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 378156

Summary: build cancelled by waiting (not run) job
Product: [Eclipse Project] Platform Reporter: Dani Megert <daniel_megert>
Component: ResourcesAssignee: John Arthorne <john.arthorne>
Status: VERIFIED FIXED QA Contact:
Severity: critical    
Priority: P2 CC: bishen, markus.kell.r
Version: 3.8   
Target Milestone: 3.8 M7   
Hardware: All   
OS: All   
Whiteboard:

Description Dani Megert CLA 2012-05-01 10:38:21 EDT
I20120430-2000 and 4.2-I20120430-1800.

If a job that waits for a scheduling rule gets cancelled, it causes the whole work manager to cancel further builds. This is very bad as the workspace will contain an outdated state, e.g. old class files, that don't match the source.

I've seen this recently when using EGit: In the latest EGit versions they newly pass a monitor to their indexer job (previously 'null') and auto-build stops working: the build is aborted by a getWorkManager().operationCanceled() call due to a (non-builder) job being cancelled. This looks wrong. It also looks like this call is made even though the job was not able to run because of a scheduling rule. This causes the refresh job not to build:

	public boolean shouldBuild() {
		if (hasBuildChanges) {
			if (operationCanceled)
				return Policy.buildOnCancel;
			return true;
		}
		return false;
	}


From JohnA:
Yes I think I see a bug here.. The general form:

Thread 1 owns a scheduling rule and is making some changes.
Thread 2 is waiting on a rule, and gets canceled while it is waiting. This sets WorkManager.operationCanceled=true
Thread 1 finishes its work. The build does not run because operationCanceled=true.

The operationCanceled field should only be changed by a thread that successfully acquired a rule/lock.
Comment 1 Markus Keller CLA 2012-05-01 12:46:02 EDT
The fix in EGit that probably made this problem appear is bug 373077, http://git.eclipse.org/c/egit/egit.git/commit/?id=6bff046874e3b51dcef78a65fe8a2631bdd7e3d8

The IndexDiffCacheEntry#scheduleReloadJob(String) calls Job#cancel() before restarting the job.
Comment 2 John Arthorne CLA 2012-05-01 14:19:44 EDT
(In reply to comment #0)
> Thread 1 owns a scheduling rule and is making some changes.
> Thread 2 is waiting on a rule, and gets canceled while it is waiting. This sets
> WorkManager.operationCanceled=true
> Thread 1 finishes its work. The build does not run because
> operationCanceled=true.
> 
> The operationCanceled field should only be changed by a thread that
> successfully acquired a rule/lock.

It's even more general than this. The fact that thread 2 failed to acquire a rule is not relevant here. Another case:

1) Thread 1 owns a scheduling rule and is making some changes.
2) Thread 2 gets a different scheduling rule, and starts a workspace modification:
3) Thread 2 is canceled. This sets WorkManager.operationCanceled=true
4) Thread 1 finishes its work. The build does not run because operationCanceled=true.

The operationCanceled field should only ever influence the current operation. It shouldn't influence another operation (either a parent op in the current thread, or a concurrent op in another thread)
Comment 4 John Arthorne CLA 2012-05-01 16:28:21 EDT
(In reply to comment #2)
> It's even more general than this. The fact that thread 2 failed to acquire a
> rule is not relevant here. Another case:

I realize I'm talking to myself, but wanted to record all the cases here for my own reference. The fact that there are two threads involved is also irrelevant. Here is a simpler case with a single thread:

- Start a WorkspaceRunnable
    - Start some workspace change that gets canceled
    - Catch the cancellation exception and continue
    - Perform some more workspace changes
- End workspace runnable

-> At the end, build does not occur because somewhere in the nested series of operations a cancellation happened. If the cancellation is not propagated up the stack to the top level operation, then it shouldn't prevent a build from happening. 

The fix I made is that the cancel flag in WorkManager no longer propagates to parent operations at all (in either current or other threads). If the OperationCanceledException itself gets propagated up the stack, then this flag will be set again at each operation depth anyway. If the client decides to swallow the cancellation within a nested operation and continue making changes, we should not be suppressing the build from happening.

Regression tests for both the single and multiple thread case have been released along with the fix. It took me several hours to craft a multi-thread case that consistently failed - the code pattern has to be very specific which probably explains why we never found this before (also typically another workspace change quickly comes along which kicks off the build).

Risk analysis: the only effect of this change is that WorkManager#shouldBuild returns true in more cases than it used to. The risk this introduces is that we might kick off an auto-build that wasn't needed. There are two more layers of optimization (Workspace#endOperation, and BuildManager#needsBuild) that short circuit unnecessary builds so I am comfortable with this risk.

Dani and Markus, thank you for finding and tracking this down!
Comment 5 Dani Megert CLA 2012-05-02 04:28:23 EDT
(In reply to comment #4)
> Risk analysis: the only effect of this change is that WorkManager#shouldBuild
> returns true in more cases than it used to. The risk this introduces is that > we
> might kick off an auto-build that wasn't needed. There are two more layers of
> optimization (Workspace#endOperation, and BuildManager#needsBuild) that short
> circuit unnecessary builds so I am comfortable with this risk.

Same here.

 
> Dani and Markus, thank you for finding and tracking this down!
I'm glad we did and thanks for the quick fix!

I verified using my known broken scenario that the fix works.
Comment 6 Dani Megert CLA 2012-05-23 03:35:40 EDT
This might just be coincidence and unrelated (or surfacing another bug), but lately I see full builds when starting Eclipse, even though I did not change the drop and did not have a crash.