| Summary: | ConcurrentModificationException on commit while holding a write lock | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Modeling] EMF | Reporter: | Sebastian Paul <sebastian.paul.ext> | ||||||||||||||||||||
| Component: | cdo.core | Assignee: | Caspar D. <caspar_d> | ||||||||||||||||||||
| Status: | CLOSED FIXED | QA Contact: | Eike Stepper <stepper> | ||||||||||||||||||||
| Severity: | normal | ||||||||||||||||||||||
| Priority: | P3 | CC: | caspar_d, saulius.tvarijonas, sebastian.paul.ext | ||||||||||||||||||||
| Version: | 4.0 | Flags: | stepper:
review+
|
||||||||||||||||||||
| Target Milestone: | --- | ||||||||||||||||||||||
| Hardware: | All | ||||||||||||||||||||||
| OS: | All | ||||||||||||||||||||||
| URL: | http://www.eclipse.org/forums/index.php?t=msg&goto=663662&S=60fa3fd91a5ad381fd1583121b4eda2e | ||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||
|
Description
Sebastian Paul
Created attachment 192625 [details]
the tests
Committed revision 7597: - trunk/plugins/org.eclipse.emf.cdo.tests Created attachment 192693 [details]
Test v2
I've rewritten your test for our test framework and I can reproduce the issue. Investigating...
Created attachment 192694 [details]
Test v3
Caspar and I agreed, that parts of the 3.0 behaviour should be restored, i.e. LockObjectsRequest should imply the SyncRevisions logic, but only for sessions with passiveUpdates==true. Caspar has no exact idea when he can start on this... (In reply to comment #5) > Caspar and I agreed, that parts of the 3.0 behaviour should be restored, i.e. > LockObjectsRequest should imply the SyncRevisions logic, but only for sessions > with passiveUpdates==true. Hmmmm... we discussed it, that's for sure. I remember my argument *against* restoring the "implicit refresh" behavior was/is as follows: if an attempt to lock a single object (or a small set of objects) requires gathering and transmitting a list of all revisions held by the client, and the server sending back a possibly large batch of updated revisions, then such an apparently minor operation (locking an object) potentially becomes rather heavyweight and slow. That doesn't seem desirable to me. Moreover, the whole issue can be avoided by the client app by just catching the StaleRevisionLockException and trying again. (If the client has PU=true then it's only a matter of time before it receives the update anyway.) Can you summarize how you previously argued that it *is* desirable to re-instate this behavior? And, assuming that your case will be convincing, we need to discuss how exactly we will realize this. The options I can think of now are: (1) Always have a locking attempt do a full refresh. (Always makes the locking attempt a heavy operation.) (2) Try a lock, then, on failure, do a full refresh, but make this automatic, i.e. hide this from the user app. (Makes the locking attempt a light op if it succeeds right away, a heavy op if it fails.) Recapping what we agreed to on Skype: When a client with PU=true tries to lock one or more objects that it holds stale revisions for, there's no point in the server sending back new revisions in the response stream of the LockObjectsRequest, because those revisions will be transmitted to that client anyway as part of a CommitTransactionRequest/Indication. So instead, we'll just grant the locks, and tell the client which CommitTxIndication (i.e. which timestamp) it needs to wait for. I'm working on this now and I notice that currently, we check first if the client has any stale revisions, and if it does not, we continue to attempt a lock. (And I'm reworking this to attempt the lock anyway if the client does have stale revisions but has PU=true). But now I wonder: does this make sense at all? The locking attempt might block for a while, and when it finally succeeds (assuming it does not time out), isn't it possible that by then the 'latest' revisions have changed? In other words, is it not necessary to re-check for stale revisions after the lock has been obtained? Please comment. I think the race condition that leads to stale revisions on the client is between the server and the client. If the locks are acquired on the server then no other client can modify the locked objects anymore and no newer revisions can occur anywhere in the network with the exception of the client that holds the lock. This makes me think that, as soon as the locking client has finally received and wired the revisions that relate to the locked objects, these revisions can not get stale anymore. Does this answer your question or did I misunderstand it? (In reply to comment #9) > Does this answer your question or did I misunderstand it? The latter, I'm afraid. I'm not talking about the possibility of revisions getting stale after the locks have been acquired. Of course, there is no such possibility. I'm talking about the possibility that a revision gets stale between the moment that the stale-or-not check is done, and the moment the lock is actually acquired. You see, things are done in THAT order, in our current logic. Example: (Note: This all happens on the server.) Client C1 wants to lock object A, of which it hold revision v3. Server checks its latest revision, which is v3, so it concludes that C1's revision is not stale. LockingManager is then invoked to put the desired lock on A, but it *blocks* for some time, presumably because another client C2 holds a lock on it but releases it before the lock request times out. So eventually the lock is granted to C1, and this client is NOT told that its revision v3 is not stale. But how do we know that C2 didn't commit v4 or later? We don't, right? I think the stale-or-not check should be performed AFTER the locks are obtained. I probably answered my own question... I'm just trying to make sure I'm not making some sort of reasoning error. (In reply to comment #10) Oh Lord, how I manage to confuse things with textual errors. I wrote: > and this client is NOT told that its revision v3 > is not stale. But it should have been: > and this client is NOT told that its revision v3 > *IS* stale. Created attachment 194693 [details]
Patch
(In reply to comment #10) > I think the stale-or-not check should be performed AFTER the locks are > obtained. I fully agree but I think that IS the case: lockObjects(); // Can take long and must come before setTimeStamp() monitor.worked(); setTimeStamp(monitor.fork()); adjustForCommit(); monitor.worked(); computeDirtyObjects(monitor.fork()); // Can throw StaleRevisionException!! Why do you think that's wrong? (In reply to comment #13) > Why do you think that's wrong? The code you quoted is from TransactionCommitContext#write. I don't see how that is relevant here. What I was asking about, is the ordering of things in LockObjectsIndication#indicating, which I am reworking at your request to restore the old "lock-implies-automatic-refresh" behavior (or whatever you want to call it). The current implementation of THAT method, checks for stale revisions first and then attempts a lock. This is wrong because the revisions might still become stale while the locking attempt is suspended. Argh, now I see! Actually both methods participate in the overall locking mechanism. But you're right, the stale-checks should be done after the lock attempt in both cases. Created attachment 194899 [details]
Patch v2
Created attachment 194900 [details]
Patch v3
Just a renaming and some reformats.
Committed revision 7654. Committed revision 7655: - trunk/plugins/org.eclipse.emf.cdo Created attachment 194904 [details]
Test v4
Hi Sebastian,
I'd like to add your tests to our suite, now that they are passing. Unfortunately they take quite long to complete and most of the time is spent waiting. Do you think you can remove the unnecessary wait periods, possibly by using smarter locks or latches? Please just reopen this bugzilla if/when you're done. Thanks ;-)
I have been moved to a different project, so there's currently no time. However, I don't see waits in the attached test. Not sure where so much time is consumed exactly. Implementation of new solution (have clients with PU=true wait for the update when server detects a stale revision during lock request) is flawed. It will deadlock in all cases where the client really has to wait, i.e. where it finds that it has not yet received the update with the timestamp that the server is telling it to wait for. Why? Because CDOViewImpl.lockObjects is declared synchronized and so hold's the view's monitor lock, and CDOViewImpl.doInvalidate (which updates the timestamp upon receiving a CommitNotification) is also declared synchronized. The lockObjects requests suspends itself by calling lastUpdateTimeLock.wait, so it releases the lastUpdateTimeLock -- but not the CDOViewImpl's monitor lock. Created attachment 195570 [details]
Patch (incremental)
Latest patch fixes the problem, but I haven't written a test yet to verify the correct behavior carefully. Will do soon. For completeness sake: [10:07:39] Eike Stepper: i think we can remove view.lastUpdateTimestampLock in favour of the general sync's in the view [10:07:54] Caspar/Jasper: ah good, that's the simplest and cleanest solution. [10:08:08] Eike Stepper: lastUpdateTimestampLock could be a relict from before-sync-everywhere times [10:08:08] Caspar/Jasper: so lastUpdateTimestampLock.wait() becomes this.wait() [10:08:39] Caspar/Jasper: and lastUpdateTimestampLock.notifyAll() becomes this.notifyAll() [10:08:46] Caspar/Jasper: inside CDOViewImpl, that is. [10:08:51] Eike Stepper: i think we should try it and write some tests (if we don't have any already) [10:08:55] Caspar/Jasper: ok. Created attachment 195574 [details]
Patch incremental v2
Decreased the sync scope in waitForUpdate().
Committed revision 7696. Reopening... I should still add a testcase. (In reply to comment #28) > Reopening... I should still add a testcase. If that/those pass/es in all standard suites you may commit it without a review ;-) Committed revision 7820. Available in R20110608-1407 Caspar, in trunk I'm occasionally getting this: junit.framework.AssertionFailedError: timeTaken == 1999 expected:<true> but was:<false> at junit.framework.Assert.fail(Assert.java:47) at junit.framework.Assert.failNotEquals(Assert.java:283) at junit.framework.Assert.assertEquals(Assert.java:64) at junit.framework.Assert.assertEquals(Assert.java:143) at org.eclipse.emf.cdo.tests.bugzilla.Bugzilla_341995_Test.test(Bugzilla_341995_Test.java:62) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:168) at org.eclipse.net4j.util.tests.AbstractOMTest.runBare(AbstractOMTest.java:221) at org.eclipse.emf.cdo.tests.config.impl.ConfigTest.runBare(ConfigTest.java:526) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at org.eclipse.net4j.util.tests.AbstractOMTest.run(AbstractOMTest.java:267) at junit.framework.TestSuite.runTest(TestSuite.java:243) at org.eclipse.emf.cdo.tests.config.impl.ConfigTestSuite$TestWrapper.runTest(ConfigTestSuite.java:119) at junit.framework.TestSuite.run(TestSuite.java:238) at junit.framework.TestSuite.runTest(TestSuite.java:243) at junit.framework.TestSuite.run(TestSuite.java:238) at junit.framework.TestSuite.runTest(TestSuite.java:243) at junit.framework.TestSuite.run(TestSuite.java:238) at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Any idea why? The test logic doesn't look very robust. Committed revision 9204 Committed revision 9411 to trunk (test timing enhancement) |