Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 341319

Summary: NPE on commit after connection lost when using ReconnectingSessionConfiguration
Product: [Modeling] EMF Reporter: Sebastian Paul <sebastian.paul.ext>
Component: cdo.net4jAssignee: Caspar D. <caspar_d>
Status: CLOSED WORKSFORME QA Contact:
Severity: normal    
Priority: P3 CC: caspar_d
Version: 4.0   
Target Milestone: ---   
Hardware: All   
OS: All   
URL: http://www.eclipse.org/forums/index.php?t=msg&goto=660535&S=1290accdc8c19348495aa655b82c3309
Whiteboard:
Attachments:
Description Flags
The test which shows the problem
none
Fixed test (succeeds)
none
Fixed tests (Caspar's version) none

Description Sebastian Paul CLA 2011-03-30 04:21:33 EDT
Build Identifier: 4.0.0.v20110303-0827

When using ReconnectingSessionConfiguration, a NullPointerException is thrown when committing a transaction after the connection was interrupted.
For this, I have attached a JUnit test.

The test "testConnectionLostWithRecovery" succeeds.
But the test "testTransactionWithLostConnectionAndRecovery" fails:
org.eclipse.emf.cdo.util.CommitException: java.lang.NullPointerException
at
org.eclipse.emf.internal.cdo.transaction.CDOTransactionImpl.commit(CDOTransactionImpl.java:1067)
at
org.eclipse.emf.internal.cdo.transaction.CDOTransactionImpl.commit(CDOTransactionImpl.java:1073)
at ITCdoConnection.testTransactionWithLostConnectionAndRecovery
...
Caused by: java.lang.NullPointerException
    at org.eclipse.emf.internal.cdo.session.DelegatingSessionProtocol.getSession(DelegatingSessionProtocol.java:117)
    at org.eclipse.emf.internal.cdo.session.DelegatingSessionProtocol.handleException(DelegatingSessionProtocol.java:799)
    at org.eclipse.emf.internal.cdo.session.DelegatingSessionProtocol.commitTransaction(DelegatingSessionProtocol.java:270)
    at org.eclipse.emf.internal.cdo.transaction.CDOSingleTransactionStrategyImpl.commit(CDOSingleTransactionStrategyImpl.java:80)
    at org.eclipse.emf.internal.cdo.transaction.CDOTransactionImpl.commit(CDOTransactionImpl.java:1053)
    ... 40 more

The NPE comes from DelegatingSessionProtocol#commitTransaction.
Obviously, the call ReconnectingCDOSessionImpl(CDOSessionImpl).setSessionProtocol(CDOSessionProtocol) line: 301 is done too late (runs in a different thread).
When I set a breakpoint, the other thread has the chance to set the delegate and my test succeeds. 

Reproducible: Always

Steps to Reproduce:
1. Import the attached project
2. Run ITCdoConnection as JUnit test
3. All tests should succeed except from testTransactionWithLostConnectionAndRecovery
Comment 1 Sebastian Paul CLA 2011-03-30 04:22:19 EDT
Created attachment 192165 [details]
The test which shows the problem
Comment 2 Caspar D. CLA 2011-03-30 05:11:07 EDT
I haven't yet been able to reproduce the problem with your testcases.
This may be because we're on different platforms. I'm on Linux. Are
you on Windows?

In any case, what's important is that it's the user app's responsibility
to avoid making calls on the protocol (or calls on the session that will
trigger calls on the protocol) while connection recovery is in
progress. The app must register a listener with the session object, and
listener for CDOSessionRecoveryEvent.Type.STARTED and FINISHED, and not
make any calls between those events.

From a quick look at your testcases, I get the impression there's no
code that does this, and so my guess is that 

The reason things work this way, is that otherwise a RecoveringSession
would have to override every session method to wrap the super invocation
with something like 

  if (weHappenToBeDoingRecoveryRightNow())
  {
    // Throw ex? Or suspend caller?
  }
  else
  {
    super.doTheNormalThing();
  }

Please comment further if I misunderstand.
Comment 3 Caspar D. CLA 2011-03-30 05:13:42 EDT
I see I left a sentence unfinished. What I meant to say was:

> From a quick look at your testcases, I get the impression there's no
> code that does this, and so my guess is that 

your testcode proceeds prematurely with another attempt to commit.
Comment 4 Sebastian Paul CLA 2011-03-30 08:54:16 EDT
Thanks for the quick response.
I have fixed the test, now it succeeds.
I am still wondering why reconnect takes so long (10sec). I tried tweaking this via the session configuration, but without any effect.
Comment 5 Sebastian Paul CLA 2011-03-30 08:55:04 EDT
Created attachment 192181 [details]
Fixed test (succeeds)
Comment 6 Caspar D. CLA 2011-03-31 02:30:21 EDT
Created attachment 192248 [details]
Fixed tests (Caspar's version)

I had to make a few changes to your tests to make them work
reliably in my environment. Specifically, I replaced the 
boolean 'recoveryInProgress' with a latch, and added a 
check to see whether the real proxy process is actually still
running when we attempt to stop it. The tests all pass on my
platform as well now.
Comment 7 Caspar D. CLA 2011-03-31 02:49:33 EDT
(In reply to comment #4)
> I am still wondering why reconnect takes so long (10sec). I tried
> tweaking this via the session configuration, but without any effect.

I've looked into this a bit, and I get the impression that the actual
recovery is very fast, but that it is slowed down somehow by the 
restart of the proxy process.

What makes me say this, is that if I set a breakpoint in your #notifyEvent
handler, and keep it suspended there for 10 seconds or so, while allowing
the main thread to restart the proxy, then when I resume the thread later
the recovery is almost instantaneous. That is, the recovery FINISHED event
is fired less than a second later.

We can discuss further, but I'll flag this issue as RESOLVED/WORKSFORME
because the original issue as described in the summary has been addressed
without changing the recovery logic.
Comment 8 Eike Stepper CLA 2012-09-21 06:51:41 EDT
Closing.