Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 341995 - ConcurrentModificationException on commit while holding a write lock
Summary: ConcurrentModificationException on commit while holding a write lock
Status: CLOSED FIXED
Alias: None
Product: EMF
Classification: Modeling
Component: cdo.core (show other bugs)
Version: 4.0   Edit
Hardware: All All
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Caspar D. CLA
QA Contact: Eike Stepper CLA
URL: http://www.eclipse.org/forums/index.p...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-04-06 05:43 EDT by Sebastian Paul CLA
Modified: 2011-09-28 06:18 EDT (History)
3 users (show)

See Also:
stepper: review+


Attachments
the tests (60.14 KB, application/x-zip-compressed)
2011-04-06 05:45 EDT, Sebastian Paul CLA
no flags Details
Test v2 (8.27 KB, patch)
2011-04-07 00:07 EDT, Eike Stepper CLA
no flags Details | Diff
Test v3 (8.03 KB, patch)
2011-04-07 00:32 EDT, Eike Stepper CLA
no flags Details | Diff
Patch (13.40 KB, patch)
2011-05-04 06:43 EDT, Caspar D. CLA
no flags Details | Diff
Patch v2 (15.19 KB, patch)
2011-05-06 00:29 EDT, Caspar D. CLA
no flags Details | Diff
Patch v3 (15.90 KB, patch)
2011-05-06 01:12 EDT, Eike Stepper CLA
no flags Details | Diff
Test v4 (8.16 KB, patch)
2011-05-06 02:11 EDT, Eike Stepper CLA
no flags Details | Diff
Patch (incremental) (2.78 KB, patch)
2011-05-13 05:31 EDT, Caspar D. CLA
no flags Details | Diff
Patch incremental v2 (2.25 KB, patch)
2011-05-13 06:18 EDT, Eike Stepper CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Sebastian Paul CLA 2011-04-06 05:43:31 EDT
Build Identifier: 4.0.0.v20110404-0841

This is the ticket for http://www.eclipse.org/forums/index.php?t=msg&goto=663662&S=60fa3fd91a5ad381fd1583121b4eda2e "[CDO] Locking in CDO 3 and 4".

I have changed my tests to have no external dependencies other that CDO and JUnit.

Please note that testModifyElementConcurrently shows a different behavior when ran alone. When the whole test is run and testModifyElementSequentially runs in prior, the locks are acquired in order. I suspect the setup/teardown. Maybe CDOServerApplication does not stop properly.

Reproducible: Always

Steps to Reproduce:
1. import projects
2. run the provided launch configurations
3. run testModifyElementConcurrently individually
Comment 1 Sebastian Paul CLA 2011-04-06 05:45:09 EDT
Created attachment 192625 [details]
the tests
Comment 2 Eike Stepper CLA 2011-04-07 00:04:02 EDT
Committed revision 7597:
- trunk/plugins/org.eclipse.emf.cdo.tests
Comment 3 Eike Stepper CLA 2011-04-07 00:07:27 EDT
Created attachment 192693 [details]
Test v2

I've rewritten your test for our test framework and I can reproduce the issue. Investigating...
Comment 4 Eike Stepper CLA 2011-04-07 00:32:00 EDT
Created attachment 192694 [details]
Test v3
Comment 5 Eike Stepper CLA 2011-04-07 02:44:41 EDT
Caspar and I agreed, that parts of the 3.0 behaviour should be restored, i.e. LockObjectsRequest should imply the SyncRevisions logic, but only for sessions with passiveUpdates==true.

Caspar has no exact idea when he can start on this...
Comment 6 Caspar D. CLA 2011-04-26 02:11:30 EDT
(In reply to comment #5)
> Caspar and I agreed, that parts of the 3.0 behaviour should be restored, i.e.
> LockObjectsRequest should imply the SyncRevisions logic, but only for sessions
> with passiveUpdates==true.

Hmmmm... we discussed it, that's for sure.

I remember my argument *against* restoring the "implicit refresh" behavior
was/is as follows: if an attempt to lock a single object (or a small set of
objects) requires gathering and transmitting a list of all revisions held
by the client, and the server sending back a possibly large batch of 
updated revisions, then such an apparently minor operation (locking an
object) potentially becomes rather heavyweight and slow. That doesn't seem
desirable to me.

Moreover, the whole issue can be avoided by the client app by just catching
the StaleRevisionLockException and trying again. (If the client has PU=true
then it's only a matter of time before it receives the update anyway.)

Can you summarize how you previously argued that it *is* desirable to
re-instate this behavior?

And, assuming that your case will be convincing, we need to discuss
how exactly we will realize this. The options I can think of now are:

(1) Always have a locking attempt do a full refresh. (Always makes the
    locking attempt a heavy operation.)

(2) Try a lock, then, on failure, do a full refresh, but make this
    automatic, i.e. hide this from the user app. (Makes the locking 
    attempt a light op if it succeeds right away, a heavy op if it
    fails.)
Comment 7 Caspar D. CLA 2011-04-28 06:37:59 EDT
Recapping what we agreed to on Skype:

When a client with PU=true tries to lock one or more objects that 
it holds stale revisions for, there's no point in the server sending 
back new revisions in the response stream of the LockObjectsRequest,
because those revisions will be transmitted to that client anyway
as part of a CommitTransactionRequest/Indication. So instead, we'll
just grant the locks, and tell the client which CommitTxIndication
(i.e. which timestamp) it needs to wait for.
Comment 8 Caspar D. CLA 2011-05-04 03:56:26 EDT
I'm working on this now and I notice that currently, we check first
if the client has any stale revisions, and if it does not, we continue
to attempt a lock. (And I'm reworking this to attempt the lock anyway
if the client does have stale revisions but has PU=true).

But now I wonder: does this make sense at all? The locking attempt
might block for a while, and when it finally succeeds (assuming it
does not time out), isn't it possible that by then the 'latest'
revisions have changed? In other words, is it not necessary to
re-check for stale revisions after the lock has been obtained?

Please comment.
Comment 9 Eike Stepper CLA 2011-05-04 04:43:05 EDT
I think the race condition that leads to stale revisions on the client is between the server and the client. If the locks are acquired on the server then no other client can modify the locked objects anymore and no newer revisions can occur anywhere in the network with the exception of the client that holds the lock. This makes me think that, as soon as the locking client has finally received and wired the revisions that relate to the locked objects, these revisions can not get stale anymore.

Does this answer your question or did I misunderstand it?
Comment 10 Caspar D. CLA 2011-05-04 05:00:23 EDT
(In reply to comment #9)
> Does this answer your question or did I misunderstand it?

The latter, I'm afraid.

I'm not talking about the possibility of revisions getting stale after
the locks have been acquired. Of course, there is no such possibility.

I'm talking about the possibility that a revision gets stale between
the moment that the stale-or-not check is done, and the moment the
lock is actually acquired. You see, things are done in THAT order,
in our current logic.

Example: (Note: This all happens on the server.) Client C1 wants to lock
object A, of which it hold revision v3. Server checks its latest revision,
which is v3, so it concludes that C1's revision is not stale. LockingManager
is then invoked to put the desired lock on A, but it *blocks* for some
time, presumably because another client C2 holds a lock
on it but releases it before the lock request times out. So eventually
the lock is granted to C1, and this client is NOT told that its revision v3
is not stale. But how do we know that C2 didn't commit v4 or later? We
don't, right?

I think the stale-or-not check should be performed AFTER the locks are
obtained.

I probably answered my own question... I'm just trying to make sure I'm
not making some sort of reasoning error.
Comment 11 Caspar D. CLA 2011-05-04 05:03:46 EDT
(In reply to comment #10)

Oh Lord, how I manage to confuse things with textual errors.

I wrote:

> and this client is NOT told that its revision v3
> is not stale. 

But it should have been:

> and this client is NOT told that its revision v3
> *IS* stale.
Comment 12 Caspar D. CLA 2011-05-04 06:43:47 EDT
Created attachment 194693 [details]
Patch
Comment 13 Eike Stepper CLA 2011-05-05 06:49:15 EDT
(In reply to comment #10)
> I think the stale-or-not check should be performed AFTER the locks are
> obtained.

I fully agree but I think that IS the case:

      lockObjects(); // Can take long and must come before setTimeStamp()
      monitor.worked();

      setTimeStamp(monitor.fork());

      adjustForCommit();
      monitor.worked();

      computeDirtyObjects(monitor.fork()); // Can throw StaleRevisionException!!

Why do you think that's wrong?
Comment 14 Caspar D. CLA 2011-05-05 23:03:45 EDT
(In reply to comment #13)

> Why do you think that's wrong?

The code you quoted is from TransactionCommitContext#write. I don't see
how that is relevant here.

What I was asking about, is the ordering of things in 
LockObjectsIndication#indicating, which I am reworking at your request
to restore the old "lock-implies-automatic-refresh" behavior (or whatever 
you want to call it).

The current implementation of THAT method, checks for stale revisions
first and then attempts a lock. This is wrong because the revisions might
still become stale while the locking attempt is suspended.
Comment 15 Eike Stepper CLA 2011-05-06 00:22:35 EDT
Argh, now I see! Actually both methods participate in the overall locking mechanism. But you're right, the stale-checks should be done after the lock attempt in both cases.
Comment 16 Caspar D. CLA 2011-05-06 00:29:33 EDT
Created attachment 194899 [details]
Patch v2
Comment 17 Eike Stepper CLA 2011-05-06 01:12:39 EDT
Created attachment 194900 [details]
Patch v3

Just a renaming and some reformats.
Comment 18 Caspar D. CLA 2011-05-06 01:26:40 EDT
Committed revision 7654.
Comment 19 Eike Stepper CLA 2011-05-06 01:59:59 EDT
Committed revision 7655:
- trunk/plugins/org.eclipse.emf.cdo
Comment 20 Eike Stepper CLA 2011-05-06 02:11:30 EDT
Created attachment 194904 [details]
Test v4

Hi Sebastian,

I'd like to add your tests to our suite, now that they are passing. Unfortunately they take quite long to complete and most of the time is spent waiting. Do you think you can remove the unnecessary wait periods, possibly by using smarter locks or latches? Please just reopen this bugzilla if/when you're done. Thanks ;-)
Comment 21 Sebastian Paul CLA 2011-05-06 03:55:22 EDT
I have been moved to a different project, so there's currently no time. However, I don't see waits in the attached test. Not sure where so much time is consumed exactly.
Comment 22 Caspar D. CLA 2011-05-13 01:54:41 EDT
Implementation of new solution (have clients with PU=true wait for the 
update when server detects a stale revision during lock request) is flawed.

It will deadlock in all cases where the client really has to wait, i.e. 
where it finds that it has not yet received the update with the timestamp 
that the server is telling it to wait for.

Why? Because CDOViewImpl.lockObjects is declared synchronized and so hold's
the view's monitor lock, and CDOViewImpl.doInvalidate (which updates the
timestamp upon receiving a CommitNotification) is also declared synchronized.
The lockObjects requests suspends itself by calling lastUpdateTimeLock.wait,
so it releases the lastUpdateTimeLock -- but not the CDOViewImpl's monitor
lock.
Comment 23 Caspar D. CLA 2011-05-13 05:31:47 EDT
Created attachment 195570 [details]
Patch (incremental)
Comment 24 Caspar D. CLA 2011-05-13 05:32:27 EDT
Latest patch fixes the problem, but I haven't written a test yet to 
verify the correct behavior carefully. Will do soon.
Comment 25 Eike Stepper CLA 2011-05-13 05:43:28 EDT
For completeness sake:

[10:07:39] Eike Stepper: i think we can remove view.lastUpdateTimestampLock in favour of the general sync's in the view
[10:07:54] Caspar/Jasper: ah good, that's the simplest and cleanest solution.
[10:08:08] Eike Stepper: lastUpdateTimestampLock could be a relict from before-sync-everywhere times
[10:08:08] Caspar/Jasper: so lastUpdateTimestampLock.wait() becomes this.wait()
[10:08:39] Caspar/Jasper: and  lastUpdateTimestampLock.notifyAll() becomes this.notifyAll()
[10:08:46] Caspar/Jasper: inside CDOViewImpl, that is.
[10:08:51] Eike Stepper: i think we should try it and write some tests (if we don't have any already)
[10:08:55] Caspar/Jasper: ok.
Comment 26 Eike Stepper CLA 2011-05-13 06:18:49 EDT
Created attachment 195574 [details]
Patch incremental v2

Decreased the sync scope in waitForUpdate().
Comment 27 Caspar D. CLA 2011-05-13 06:25:34 EDT
Committed revision 7696.
Comment 28 Caspar D. CLA 2011-05-13 06:25:54 EDT
Reopening... I should still add a testcase.
Comment 29 Eike Stepper CLA 2011-05-13 06:27:37 EDT
(In reply to comment #28)
> Reopening... I should still add a testcase.

If that/those pass/es in all standard suites you may commit it without a review ;-)
Comment 30 Caspar D. CLA 2011-05-23 05:19:45 EDT
Committed revision 7820.
Comment 31 Eike Stepper CLA 2011-06-23 03:40:12 EDT
Available in R20110608-1407
Comment 32 Eike Stepper CLA 2011-09-22 11:00:31 EDT
Caspar, in trunk I'm occasionally getting this:

junit.framework.AssertionFailedError: timeTaken == 1999 expected:<true> but was:<false>
	at junit.framework.Assert.fail(Assert.java:47)
	at junit.framework.Assert.failNotEquals(Assert.java:283)
	at junit.framework.Assert.assertEquals(Assert.java:64)
	at junit.framework.Assert.assertEquals(Assert.java:143)
	at org.eclipse.emf.cdo.tests.bugzilla.Bugzilla_341995_Test.test(Bugzilla_341995_Test.java:62)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at junit.framework.TestCase.runTest(TestCase.java:168)
	at org.eclipse.net4j.util.tests.AbstractOMTest.runBare(AbstractOMTest.java:221)
	at org.eclipse.emf.cdo.tests.config.impl.ConfigTest.runBare(ConfigTest.java:526)
	at junit.framework.TestResult$1.protect(TestResult.java:110)
	at junit.framework.TestResult.runProtected(TestResult.java:128)
	at junit.framework.TestResult.run(TestResult.java:113)
	at junit.framework.TestCase.run(TestCase.java:124)
	at org.eclipse.net4j.util.tests.AbstractOMTest.run(AbstractOMTest.java:267)
	at junit.framework.TestSuite.runTest(TestSuite.java:243)
	at org.eclipse.emf.cdo.tests.config.impl.ConfigTestSuite$TestWrapper.runTest(ConfigTestSuite.java:119)
	at junit.framework.TestSuite.run(TestSuite.java:238)
	at junit.framework.TestSuite.runTest(TestSuite.java:243)
	at junit.framework.TestSuite.run(TestSuite.java:238)
	at junit.framework.TestSuite.runTest(TestSuite.java:243)
	at junit.framework.TestSuite.run(TestSuite.java:238)
	at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

Any idea why?

The test logic doesn't look very robust.
Comment 33 Eike Stepper CLA 2011-09-22 11:02:24 EDT
Committed revision 9204
Comment 34 Eike Stepper CLA 2011-09-28 06:18:51 EDT
Committed revision 9411 to trunk (test timing enhancement)