Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 328595

Summary: TimeoutException during raw replication
Product: [Modeling] EMF Reporter: Erwin Betschart <erwin>
Component: cdo.coreAssignee: Eike Stepper <stepper>
Status: CLOSED FIXED QA Contact: Eike Stepper <stepper>
Severity: major    
Priority: P3 CC: caspar_d, pascal.lehmann, stepper
Version: 4.0   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard: offline-02
Bug Depends on: 78181    
Bug Blocks:    
Attachments:
Description Flags
patch
stepper: iplog+
Patch
none
Patch v3 none

Description Erwin Betschart CLA 2010-10-25 09:02:31 EDT
Build Identifier: 4.0

The following exception can occur during raw replication:

!ENTRY org.eclipse.net4j.util 4 0 2010-10-25 14:28:12.124
!MESSAGE java.util.concurrent.TimeoutException
!STACK 0
org.eclipse.emf.cdo.common.util.TransportException: java.util.concurrent.TimeoutException
 at org.eclipse.emf.cdo.internal.net4j.protocol.CDOClientProtocol.send(CDOClientProtocol.java:368)
 at org.eclipse.emf.cdo.internal.net4j.protocol.CDOClientProtocol.replicateRepositoryRaw(CDOClientProtocol.java:311)
 at org.eclipse.emf.cdo.internal.server.syncing.RepositorySynchronizer$ReplicateRunnable.run(RepositorySynchronizer.java:440)
 at org.eclipse.net4j.util.concurrent.QueueRunner.work(QueueRunner.java:26)
 at org.eclipse.net4j.util.concurrent.QueueRunner.work(QueueRunner.java:1)
 at org.eclipse.net4j.util.concurrent.QueueWorker.work(QueueWorker.java:75)
 at org.eclipse.net4j.util.concurrent.Worker$WorkerThread.run(Worker.java:188)
Caused by: java.util.concurrent.TimeoutException
 at org.eclipse.net4j.util.io.IOTimeoutException.createTimeoutException(IOTimeoutException.java:46)
 at org.eclipse.net4j.signal.Signal.runSync(Signal.java:244)
 at org.eclipse.net4j.signal.SignalProtocol.startSignal(SignalProtocol.java:445)
 at org.eclipse.net4j.signal.RequestWithConfirmation.doSend(RequestWithConfirmation.java:87)
 at org.eclipse.net4j.signal.RequestWithConfirmation.send(RequestWithConfirmation.java:73)
 at org.eclipse.emf.cdo.internal.net4j.protocol.CDOClientProtocol.send(CDOClientProtocol.java:360)
 ... 6 more

Occurs only when big databases are replicated (around 300MB H2). 


Reproducible: Always

Steps to Reproduce:
1. Do a lot of commits on huge lists (-> The DATATYPE_LIST db table gets very big)
2. Connect a client with raw replication enabled
3. -> Raw replication fails.
Comment 1 Eike Stepper CLA 2010-10-26 01:29:16 EDT
Hi Erwin,

I noticed that the stack trace does not match the HEAD sources. Have you applied modifications or is it just not the hottest code?

If read (response) timeouts happen due to large data volumes being collected in the server, shouldn't you just increase the timeout value? You would do that in your CDOSessionConfigurationFactory where you'd create a CDOSessionConfiguration that sets the larger timeout value at the end of openSession like this:

((org.eclipse.emf.cdo.net4j.CDOSession)session).options().getProtocol().setTimeout(millis);

If you can confirm that this works for you I'd like to close this bug as WORKSFORME, but I'm waiting until you've tried it ;-)
Comment 2 Eike Stepper CLA 2010-10-26 02:42:54 EDT
It seems that setting the protocol timeout for a Net4j CDOSession becomes easier with bug 278181:

    sessionsConfiguration.setSignalTimeout(millis);
Comment 3 Erwin Betschart CLA 2010-10-27 05:00:49 EDT
(In reply to comment #1)

I'm working on HEAD but not latest.

I though about increasing the session timeout. This would fix the issue for some time depending how big the timeout is set.
The commit request (CommitTransactionRequest) is a RequestWithMonitoring which is (I think) the reason that the timeout of the session must not be increased when working with huge commits.
The ReplicateRepositoryRawRequest should also extend the RequestWithMonitoring because it will always be bigger that the biggest commit ever made.
Comment 4 Eike Stepper CLA 2010-10-27 05:33:37 EDT
> The ReplicateRepositoryRawRequest should also extend the RequestWithMonitoring
> because it will always be bigger that the biggest commit ever made.

The idea is sound but looking at the code I have some doubts that it would reliably prevent the read timeout from happening on the client.

The CommitTransactionRequest is fundamentally different because it carries the major payload on the way *to* the repo, whereas the RepositoryReplicationRequest does it on the way *back*. From this point of view the active monitoring in the repo does not seem to replace the more general read timeouts but rather implement an *additional* mechanism.

Thinking...
Comment 5 Eike Stepper CLA 2010-10-27 05:36:38 EDT
BTW. you can see the problem in RequestWithConfirmation:

  @Override
  void doExecute(BufferInputStream in, BufferOutputStream out) throws Exception
  {
    doOutput(out);
    doInput(in);
  }

A client that expects a confirmation starts immediately to read from the response stream after the last request byte has been sent. At that time also the protocol's read timeout starts to become effective.
Comment 6 Eike Stepper CLA 2010-10-27 08:44:30 EDT
My last two comments turned out to be wrong: The RequestWithMonitoring mechanism *does* already reset the normal read timout ;-)
Comment 7 Erwin Betschart CLA 2010-10-27 11:31:20 EDT
Created attachment 181846 [details]
patch

Implementation with monitoring.
Comment 8 Erwin Betschart CLA 2010-10-28 11:26:47 EDT
Created attachment 181960 [details]
Patch

Previous fix didn't work.

-> Monitor is now started.
-> Breaks the progress monitor on client side.

Therefore not yet final fix.
Comment 9 Eike Stepper CLA 2010-10-30 03:50:55 EDT
Created attachment 182096 [details]
Patch v3

I've consolidated the monitored signals in CDO.
Comment 10 Erwin Betschart CLA 2010-11-03 07:03:16 EDT
(In reply to comment #9)
> Created an attachment (id=182096) [details]
> Patch v3
> 
> I've consolidated the monitored signals in CDO.

Tested your patch. Works well except the progress on the repository synchronizer 
is broken.

Always starts at 96%.
Comment 11 Erwin Betschart CLA 2010-11-15 06:42:54 EST
Opened bug 330227 for the problem mentioned in comment 10.
Therefore this bug can be closed.
Comment 12 Eike Stepper CLA 2010-11-15 10:27:26 EST
Thank you, Erwin!

Committed to HEAD
Comment 13 Eike Stepper CLA 2011-06-23 03:42:15 EDT
Available in R20110608-1407