Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 262875 - java.nio.BufferUnderFlowException
Summary: java.nio.BufferUnderFlowException
Status: CLOSED FIXED
Alias: None
Product: EMF
Classification: Modeling
Component: cdo.net4j (show other bugs)
Version: 2.0   Edit
Hardware: PC Windows Vista
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Andre Dietisheim CLA
QA Contact: Eike Stepper CLA
URL:
Whiteboard:
Keywords:
: 295989 (view as bug list)
Depends on: 268169
Blocks: 561532
  Show dependency tree
 
Reported: 2009-01-29 03:54 EST by David Bonneau CLA
Modified: 2020-03-28 04:04 EDT (History)
5 users (show)

See Also:
stepper: review+


Attachments
JUnit test case (8.46 KB, application/octet-stream)
2009-01-29 03:54 EST, David Bonneau CLA
stepper: iplog+
Details
patch that solves BufferUnderflowException (8.55 KB, patch)
2010-01-08 04:01 EST, Andre Dietisheim CLA
no flags Details | Diff
Patch v2 (against 2.0!) (5.28 KB, patch)
2010-01-09 04:52 EST, Eike Stepper CLA
no flags Details | Diff
Patch v3 (against 2.0) (15.08 KB, patch)
2010-01-11 18:46 EST, Andre Dietisheim CLA
no flags Details | Diff
Patch v4 - ready to be committed (18.41 KB, patch)
2010-01-15 08:57 EST, Eike Stepper CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description David Bonneau CLA 2009-01-29 03:54:42 EST
Created attachment 124126 [details]
JUnit test case

Build ID: I20080617-2000

Steps To Reproduce:
I'm using Net4J 2.0.0M4 and CDO 2.0.0M4 (with a JVMConnector) and I have the following exception :

java.nio.BufferUnderflowException
	at java.nio.Buffer.nextGetIndex(Buffer.java:404)
	at java.nio.DirectByteBuffer.getShort(DirectByteBuffer.java:529)
	at org.eclipse.net4j.signal.SignalProtocol.handleBuffer(SignalProtocol.java:215)
	at org.eclipse.spi.net4j.Channel$ReceiverWork.run(Channel.java:305)
	at org.eclipse.net4j.util.concurrent.QueueRunner.work(QueueRunner.java:26)
	at org.eclipse.net4j.util.concurrent.QueueRunner.work(QueueRunner.java:1)
	at org.eclipse.net4j.util.concurrent.QueueWorker.work(QueueWorker.java:64)
	at org.eclipse.net4j.util.concurrent.Worker$WorkerThread.run(Worker.java:154)

More information:
The exception occurs when I commit a CDO transaction : CommitTransactionRequest.send(). I try to find in the net4J code what's going wrong and I found the following explanation :

In the Signal.doOutput() method data are writing in the BufferOutputStream. When the (underlying buffer of) BufferOutputStream is full, BufferOutputStream flush method is called. Then the buffer is handled by a channel and sended to the server.

 The problem happens if the last write of the Signal.doOutput() method fill the current buffer. The buffer is flushed :
- Server : SignalProtocol.handleBuffer : ok
- Client : SignalProtocol.handleBuffer : ok (response from CommitTransactionIndication), in the signals map of SignalProtocol we remove the current signal.
- Client : And now the Signal.doOutput() method call BufferOutputStream.flushEOS() :
ensureBuffer create a new empty buffer, flush it and send it. 
- Server : In the SignalProtocol.handleBuffer() we don't find it in the signals map : we try to read the signal id in the buffer ... and ... BufferUnderFlowException !!!

I hope you have understand what I tried to explain :)

To solve the problem I change the code of flushWithEOS method : 

  public void flushWithEOS() throws IOException
  {
    if(currentBuffer != null){
      currentBuffer.setEOS(true);
      flush();
    }
  }

But how about the EOS state ? How the protocol managed it ? If we never send eos flag what happens ? We should not remove the signal from the signals map before we receive a eos ? 

I attach a small JUnittest that reproduces the bug.
Comment 1 Eike Stepper CLA 2009-01-29 04:15:16 EST
David, Thanks for the analysis! I'll put it on the top of my list when today's webinar is onde ;-)
Comment 2 Eike Stepper CLA 2009-03-31 03:34:10 EDT
André, This one should be caught by our new white box testing effort...
Comment 3 Andre Dietisheim CLA 2009-12-30 11:54:46 EST
Added Bugzilla262875_Test to org.eclipse.net4j.tests
Comment 4 Andre Dietisheim CLA 2009-12-30 11:56:04 EST
I really think this bug's the same as 

https://bugs.eclipse.org/bugs/show_bug.cgi?id=295989

By luck, we have a test case here. Thanks David!!
Comment 5 Eike Stepper CLA 2009-12-31 01:29:08 EST
Hi David, please confirm that you're the only author of the contributed code and that you're applying the EPL to it. If we don't get your confirmation on this I have to remove the code from CVS. Sorry, this is the Foundation's IP process and André didn't know it, yet.
Comment 6 Eike Stepper CLA 2009-12-31 02:18:37 EST
Changed Worker.java so that test case actually fails on BufferUnderflowException.
Comment 7 David Bonneau CLA 2010-01-04 03:55:13 EST
(In reply to comment #5)
> Hi David, please confirm that you're the only author of the contributed code
> and that you're applying the EPL to it. If we don't get your confirmation on
> this I have to remove the code from CVS. Sorry, this is the Foundation's IP
> process and André didn't know it, yet.

Hi, I'm the only author of the contributed code and I'm applying the EPL to it :)
Comment 8 Andre Dietisheim CLA 2010-01-06 18:52:51 EST
thanks David, 
the work you did analyzing the cause was tremendous and completely correct. It was a great help! 
Your unit-test tests the behavior when the data to send (payload plus channelID metadata) exactly matches the buffer capacity (4096 bytes). In this case a second empty (has nothing but the channelID) buffer's sent over the wire. The server tries to create an actor for it and therefore reads the correlationID, that's not present. I found exactly what you described. In my eyes, the proper reason is that BufferOutputStream#write() flushes a buffer if there's no room for further data in it. In your unit test, flush() occurs twice:

- once in BufferOutputStream#write (after writing the last payload byte, the write operation flushes the buffer, because there's no further space in it)

- a second time in Signal#doOutput when it flushes with EOS (out.flushWithEOS()) 

IMHO it is sufficient to change BufferOutputStream#write so that it flushes before writing further data (and not after writing data). 
I extracted the exception throwing in #ensureBuffer and in my patch BufferOutputStream#write looks like this:

  @SuppressWarnings("deprecation")
  @Override
  public void write(int b) throws IOException
  {
    throwExceptionOnError();
    ensureBuffer();
    ByteBuffer buffer = currentBuffer.getByteBuffer();
    if (!buffer.hasRemaining())
    { // there's no room left for further bytes in the buffer
      flush();
    }

    if (TRACER.isEnabled())
    {
      TRACER.trace("--> " + HexUtil.formatByte(b) //$NON-NLS-1$
          + (b >= 32 ? " " + Character.toString((char)b) : "")); //$NON-NLS-1$ //$NON-NLS-2$
    }

    buffer.put((byte)b);
  }

I'm reluctant to provide it yet, because SecurityTest currently fails in my workspace (but it fails even without my patch). Furthermore I found a further case that it not handled correctly:

increase the payload data to > 4096. Now you get further errors with non-matching correlationID. I'd like to fix this prior to release my patch.
Comment 9 Eike Stepper CLA 2010-01-07 03:06:50 EST
André, didn't we agree that this should be fixed on the receiving party side?
Comment 10 Andre Dietisheim CLA 2010-01-07 04:04:34 EST
(In reply to comment #9)
> André, didn't we agree that this should be fixed on the receiving party side?

absolutely. In my eyes there are 2 cases here:

1.) the payload matches exactly the size of the buffer. The current implementation sends a first buffer and sends an empty second one. IMHO it shouldn't send it. That's what I fixed so far

2.) the payload exceeds the available buffer size. In this case 2/more buffers shall be sent over the wire and the reciever has to be fixed to handle that.
Comment 11 Eike Stepper CLA 2010-01-07 07:39:18 EST
I don't understand. Maybe with a patch of your changes in the sender it would be easier.

AFAIU the problem is only that in very rare cases (but eventually predictable per application protocol) the last payload bytes of a signal fill up a full buffer so that the buffer is flushed/sent. When the signal wants to set the EOS bit thereafter there is no used buffer available and a new one is provided. The EOS bit is set and it is sent. Without further payload! And this buffer without payload causes the problem at the receiver. Correct?
Comment 12 Andre Dietisheim CLA 2010-01-07 15:13:51 EST
> 
> AFAIU the problem is only that in very rare cases (but eventually predictable
> per application protocol) the last payload bytes of a signal fill up a full
> buffer so that the buffer is flushed/sent. When the signal wants to set the EOS
> bit thereafter there is no used buffer available and a new one is provided. The
> EOS bit is set and it is sent. Without further payload! And this buffer without
> payload causes the problem at the receiver. Correct?

exact. IMHO this is a flaw on its own. I believe that the second empty buffer musn't be sent. I believe that this can be prevented if flushing does not occur after but before writing payload bytes.

A second flaw are bugs in the receiver that occur if the payload is split into several buffers. This occurs only if there's a delay between the send operations (have a look at the test case, it sleeps the tread for a few milliseconds after writing).
Comment 13 Andre Dietisheim CLA 2010-01-08 04:01:46 EST
Created attachment 155579 [details]
patch that solves BufferUnderflowException

this patch solves the bug that causes the BufferUnderflowException. As mentioned above, this happens if the data to send (channelID + correlationID + payload) exactly matches the buffer size (4096).
The code changes are in BufferOutputStream#write. I furthermore refactored SignalProtocol (split methods, inserted comments) but did not change any functionality. I added comments to Channel, too but did not touch any code there.
 
I'll now get over to the second bug (mentioned above) that occurs if the data to send is larger than the buffer size.
Comment 14 Eike Stepper CLA 2010-01-08 05:12:26 EST
Comment on attachment 155579 [details]
patch that solves BufferUnderflowException

Hi Andre, Please request reviews for the bugzilla, not for single patches. And enter the email of the requested reviewer behind the review? flag. Thx ;-)
Comment 15 Andre Dietisheim CLA 2010-01-08 16:48:47 EST
I could spend a few hours to investigate on the second bug (where the data's
larger than 1 buffer and gets split into 2 buffers that are sent in sequence).


I changed the Indication to read the data that it gets:

      protected void indicating(ExtendedDataInputStream in) throws Exception
      {
        int i = 0;
        while (in.read() != -1)
        {
          i++;
        }
        System.out.println(i + " bytes read.");
      }

I noticed 2 strange things so far:

1.) the buffer, where the BufferInputStream gets its data releases it's buffer
far too early. Only a fraction of the data data may be read in the indication. 

2.) Furthermore, the buffer gets released too early. BufferInputStream#read
checks whether there are more bytes to be read from the underlying buffer:

    if (!currentBuffer.getByteBuffer().hasRemaining())
    {
      currentBuffer.release();
      currentBuffer = null;
    }

Occasionly currentBuffer.getByteBuffer() complains about a bad state of the
byteBuffer ([RELEASED] instead of [GETTING] or [PUTTING]). This sounds like a
concurrency issue to me.

I wondered if there are any net4j indications that successfully implement
reading large portions of data. 
I checked with our current tests and found the following comment in
TransportTest#testTextStreaming:

  /**
   * TODO Fails occasionally ;-( Caused by: java.lang.IllegalStateException:
selectionKey == null
   */

Sounds related to me. 

Do I miss something? Any further thoughts and comments?
Comment 16 Andre Dietisheim CLA 2010-01-08 17:46:57 EST
Reading my comment a second time, I found several unfinished sentences and repetitions. Might be once more just too late and time to quit hacking.

Thinking twice about the issues (buffers get released too early), I get the impression, that this might be due to the fact that the test does not properly wait until the request/response sequence's completely processed (I might just have to insert another latch/lock between both parts). Let me look at it again as soon as I feel rested again ;-)

Nevertheless any comments related to TransportTest and transferring large data portions in net4j are highly welcome.
Comment 17 Eike Stepper CLA 2010-01-09 04:30:05 EST
Search for usages of org.eclipse.net4j.tests.data.HugeData. These tests use ChannelXyzStreams directly, i.e. without Signal/SignalProtocol. Maybe the "huge data" is not huge enough. You could multiply the test data if needed...
Comment 18 Eike Stepper CLA 2010-01-09 04:52:12 EST
This is a 2.0 bugzilla but the original test case and changes to Worker.java have been committed to HEAD. Where shall I apply the latest patch with the fix?
Comment 19 Eike Stepper CLA 2010-01-09 04:52:54 EST
Created attachment 155652 [details]
Patch v2 (against 2.0!)
Comment 20 Andre Dietisheim CLA 2010-01-10 16:58:52 EST
(In reply to comment #18)
> This is a 2.0 bugzilla but the original test case and changes to Worker.java
> have been committed to HEAD. Where shall I apply the latest patch with the fix?

sorry for that, I oversaw it. Shouldn't we backport the changes in Woker and the Test to 2.0? I could do that tomorrow and make sure the test succeeds in 2.0.
Furthermore we need the fix in 3.0, too.
Comment 21 Eike Stepper CLA 2010-01-11 03:49:19 EST
Yes, please ;-)
Comment 22 Andre Dietisheim CLA 2010-01-11 18:46:00 EST
Created attachment 155808 [details]
Patch v3 (against 2.0)

patch that includes:
- modifications to Worker 
- Bugzilla262875_Test
- Fixes to org.eclipse.net4j (BufferOutputStream)
Comment 23 Andre Dietisheim CLA 2010-01-13 17:59:13 EST
To sum up:

bug 1:
BufferUnderflowException if the data to transfer (channledID + correlationID + payload) exactly matches the buffer size:

SOLVED

bug 2:
IllegalArgumentException if the data to transfer does not fit into 1 buffer:

INVALID

The reason that this occurs is that the Indication (in the current unit test) does not read the bytes it was sent.

If the data to transfer is larger than the buffer capacity, it is split up into 2 or more buffers. The first buffer contains channeldID + correlationID + [bytes that fit into the current buffer]. The second (and any further buffer) contains channelID + [remaining bytes]. If the Indication does not read from the underlying BufferInputStream, there's no way for the BufferInputStream to know, that there are 2 or more buffers it should process. The first buffer is polled, the second one stays in the queue. The SignalProtocol catches then up and gets the queued buffer. It thinks its a new signal, tries to read the signalID from it, gets a data byte instead (signalID is not present!) and throws an IllegalArgumentException. 
A further failing case raises up here where the data matches a legal signalID and there's no IllegalArgumentException thrown.

Explained in depth, the BufferInputStream waits for incoming data in the read method. The read method stops polling for buffers only if 

a) EOS
b) timeout

If the Indication never reads data, the BufferInputStream has no chance to poll all the buffers (until EOS or timeout) it was sent.

We might think about reading the BufferInputStream in Signal#doInput to ensure all buffers are consumed if the indication does not read its buffers. This way SignalProtocol would never read a buffer that was a second buffer sent to the same Indication.
Comment 24 Eike Stepper CLA 2010-01-14 05:46:35 EST
(In reply to comment #23)
> To sum up:
> 
> bug 1:
> BufferUnderflowException if the data to transfer (channledID + correlationID +
> payload) exactly matches the buffer size:
> 
> SOLVED

Excellent.

> bug 2:
> IllegalArgumentException if the data to transfer does not fit into 1 buffer:
> 
> INVALID

What does that mean?

> The reason that this occurs is that the Indication (in the current unit test)
> does not read the bytes it was sent.
> 
> If the data to transfer is larger than the buffer capacity, it is split up into
> 2 or more buffers. The first buffer contains channeldID + correlationID +
> [bytes that fit into the current buffer]. The second (and any further buffer)
> contains channelID + [remaining bytes]. 

Every buffer has a 4 bytes "buffer header" that carries information for the channel multiplexer: short channelID + short bufferSize. These informations are used to read the payload (A) and hand it to the appropriate channel.

If the receiveHandler of the channel is a SignalProtocol the payload (A) is further split into a 4 bytes "signal header" and a signal payload (B). *Each* signal buffer contains this header, the int correlationID. Not only the first in a sequence! This correlationID is used to pass the buffer into the appropriate signal *instance*.

For the first buffer of an Indication there is no existing *instance* of a signal. That's why only this first additionally contains the signalID, which is really a signal *type* ID. It's used to initially create a signal instance and register it with the signals map under the correlationID for future lookup.

> If the Indication does not read from
> the underlying BufferInputStream, there's no way for the BufferInputStream to
> know, that there are 2 or more buffers it should process.

That confuses me. Why should the indication not read while there do exist more buffers? This sounds like an application protocol bug.

> The first buffer is
> polled, the second one stays in the queue. 

Ok.

> The SignalProtocol catches then up

What is that?

> and gets the queued buffer. It thinks its a new signal, tries to read the
> signalID from it, gets a data byte instead (signalID is not present!) and
> throws an IllegalArgumentException. 

It only thinks there's a *new* signal if both are true:
a) correlationID > 0 (meaning that the incoming buffer belongs to an Indication as opposed to a Confirmation of a request)
b) signals.get(-correlationID) == null (meaning that no signal instance already exists for this incoming Indication data).

In this case the SignalProtocol expects a signalID to be present following the correlationID, so that a new signal instance can be created and registered. If that's not the case a severe problem will result.

> A further failing case raises up here where the data matches a legal signalID
> and there's no IllegalArgumentException thrown.

Well, the reason is the same. Only the phenomenon is different, by accident.

> Explained in depth, the BufferInputStream waits for incoming data in the read
> method. 

Actually in ensureBuffer().

> The read method stops polling for buffers only if 
> 
> a) EOS
> b) timeout
> 
> If the Indication never reads data, the BufferInputStream has no chance to poll
> all the buffers (until EOS or timeout) it was sent.

I don't follow. Didn't you fix the sender side so that EOS is no longer sent in a buffer with no payload?

> We might think about reading the BufferInputStream in Signal#doInput to ensure
> all buffers are consumed if the indication does not read its buffers. This way
> SignalProtocol would never read a buffer that was a second buffer sent to the
> same Indication.
Comment 25 Andre Dietisheim CLA 2010-01-14 06:25:45 EST
we decided that bug2 is due to errors in the application protocol. It's therefore  INVALID. Bug1 is SOLVED and therefore we can now commit the fix to head and close this bugzilla. We'll open a CDO 3.0 bugzilla and apply the fix to CDO 3.0 (the fix's already available in this bugzilla: its the obsolete patch from 2010-01-08), too.
Comment 26 Eike Stepper CLA 2010-01-15 08:57:25 EST
Created attachment 156220 [details]
Patch v4 - ready to be committed

Well done, guys!
Comment 27 Andre Dietisheim CLA 2010-01-16 05:10:45 EST
*** Bug 295989 has been marked as a duplicate of this bug. ***
Comment 28 Andre Dietisheim CLA 2010-01-16 05:11:24 EST
Committed to R2_0_maintenance