Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 359184

Summary: SftpException when doing SftpFileService.getFile
Product: [Tools] Target Management Reporter: George Dinkov <go6etobate>
Component: RSEAssignee: Martin Oberhuber <mober.at+eclipse>
Status: RESOLVED FIXED QA Contact: Martin Oberhuber <mober.at+eclipse>
Severity: normal    
Priority: P3 CC: adietish, go6etobate, john.arthorne, petar.petrov, ymnk
Version: 3.3Keywords: helpwanted
Target Milestone: 3.4 RC1   
Hardware: PC   
OS: Windows 7   
Whiteboard:
Bug Depends on: 360663    
Bug Blocks:    
Attachments:
Description Flags
Binary JSch-0.1.45 bundle none

Description George Dinkov CLA 2011-09-28 03:03:54 EDT
Build Identifier: 20110615-0604

I'm using the SSH file service in my connection and EFS for the test. When doing fileStore.fetchInfo(EFS.NONE, null) from multiple threads started with some delay in between some of the calls fail. This also happens when I do openInputStream. The exception is thrown in SftpFileService.getFile (and getInputStream) and happens occasionally in the provided test case.
It can also be reproduced consistently using breakpoints and starting 3 threads which call fileStore.fetchInfo(EFS.NONE, null) as described.
As it seems after ChannelSftp.sendLSTAT is called another thread closes the com.jcraft.jsch.Channel.MyPipedInputStream and this causes the error.
I haven't encountered this problem when testing manually but it happens pretty often when running SWTBot tests.
I'm using the latest release of RSE. The version of RSE SSH Services is 3.0.300.v201106011538-7A3F9wAkF7BB7FAKECRM

Reproducible: Always

Steps to Reproduce:
I'm using a test as the following:
final IFileStore fileStore = ...// a file on the remote machine
		for(int i = 0; i < 1000000; ++i)
		{
			System.out.println("Iteration " + i);
			Runnable testThread = new Runnable()
			{
				@Override
				public void run()
				{
					try
					{
						fileStore.fetchInfo(EFS.NONE, null);
					}
					catch(CoreException e)
					{
						e.printStackTrace();
					}
					catch(Throwable t)
					{
						t.printStackTrace();
					}
				}
			};
			Thread th = new Thread(testThread);
			th.start();
			sleep(500);
		}
Usually after a few hundred iterations I get the following error (output with tracing enabled):

Iteration 258
2011-09-27 11:51:15.648 | Thread-269 | SftpFileService.getFile: /home/georged/workspace/OneFileProject/Program1.c
2011-09-27 11:51:15.650 | Thread-269 | SftpFileService.getFile failed: 4: 
org.eclipse.core.runtime.CoreException: Operation failed. File system input or output error: rse://10.123.11.61/home/georged/workspace/OneFileProject/Program1.c
	at org.eclipse.rse.internal.efs.RSEFileStoreImpl.rethrowCoreException(RSEFileStoreImpl.java:637)
	at org.eclipse.rse.internal.efs.RSEFileStoreImpl.getRemoteFileObject(RSEFileStoreImpl.java:357)
	at org.eclipse.rse.internal.efs.RSEFileStoreImpl.fetchInfo(RSEFileStoreImpl.java:550)
	at org.eclipse.rse.internal.efs.RSEFileStore.fetchInfo(RSEFileStore.java:305)
	at com.mytest.eclipse.tests.TestTest$1.run(TestTest.java:70)
	at java.lang.Thread.run(Thread.java:662)
Caused by: org.eclipse.rse.services.files.RemoteFileIOException: Operation failed. File system input or output error
	at org.eclipse.rse.internal.services.ssh.files.SftpFileService.makeSystemMessageException(SftpFileService.java:594)
	at org.eclipse.rse.internal.services.ssh.files.SftpFileService.getFile(SftpFileService.java:643)
	at org.eclipse.rse.subsystems.files.core.servicesubsystem.FileServiceSubSystem.getFile(FileServiceSubSystem.java:384)
	at org.eclipse.rse.subsystems.files.core.servicesubsystem.FileServiceSubSystem.getRemoteFileObject(FileServiceSubSystem.java:209)
	at org.eclipse.rse.internal.efs.RSEFileStoreImpl.getRemoteFileObject(RSEFileStoreImpl.java:355)

And when breaking on the thrown exception, the original exception looks like this:
SftpException  4: 
cause: java.io.IOException: inputstream is closed
no message or stack trace

Another way I could reproduce the problem consistently is by using breakpoints. 
1. I have a breakpoint on the call of fileStore.fetchInfo in my run() method.
2. I start 3 threads only.
3. When they all stop at the breakpoint I put a breakpoint on the first line of org.eclipse.rse.internal.services.ssh.files.SftpFileService.getFile(String, String, IProgressMonitor)
4. I continue all threads and the first doesn't stop on the breakpoint and finishes, the second and the third stop.
5. Now put a breakpoint on ChannelSftp._lstat or step to there using one of the two threads
6. Also put a breakpoint on java.io.PipedInputStream.close()
6. Continue to and step over the sendLSTAT call - this will be OK on the first thread, continue the thread
7. Do the same with the second thread - now when you step over sendLSTAT the breakpoint on PipedInputStream.close is hit and if this thread continues first the stream gets closed and an exception is thrown in the original thread.

It seems that the problem may be in jcraft or in the way it is used in RSE. Also this may be related to https://bugs.eclipse.org/bugs/show_bug.cgi?id=268463 but I'm not doing connect/disconnect.

Best regards,
George
Comment 1 Martin Oberhuber CLA 2011-09-28 03:49:16 EDT
What version of Eclipse Platform and JSch are you using?

You might want to try JSch 0.1.44 from http://www.jcraft.com/ -- just copying the JAR into eclipse/dropins should be sufficient to pick up the new version after a restart.
Comment 2 Atsuhiko Yamanaka CLA 2011-09-28 04:24:32 EDT
(In reply to comment #0)
>         for(int i = 0; i < 1000000; ++i)
>         {
>             System.out.println("Iteration " + i);
>             Runnable testThread = new Runnable()
>             {
>                 @Override
>                 public void run()
>                 {
>                     try
>                     {
>                         fileStore.fetchInfo(EFS.NONE, null);
>                     }
...
>             };
>             Thread th = new Thread(testThread);
>             th.start();
>             sleep(500);

It seems there is a possibility to invoke fileStore.fetchInfo simultaneously,
and jsch's ChannelSftp#get may be also invoked simultaneously.
ChannelSftp has not expected such a usage.

How about wrapping it with synchronized(fileStore)?
Comment 3 Martin Oberhuber CLA 2011-09-28 06:20:16 EDT
Thanks for the comment and suggestion ... but calling out from a synchronized block into unknown code is a recipe for deadlock, so that's a no-go from my point of view.

Also, the lower levels of RSE / Sftp may cache information so they need to care for synchronizing access in the lower levels. In fact SftpFileService uses fDirChannelMutex to ensure that directory access (like ls below fetchInfo) is serialized.

Something else must go wrong here.
Comment 4 George Dinkov CLA 2011-09-28 06:41:29 EDT
My target platform is Version: 3.7.0, Build id: I20110613-1736

JSch is 0.1.41.v201101211617
I upgraded to 0.1.42 with the same result.
I'll try to find a plugin for 0.1.44 (the jar from http://www.jcraft.com/ isn't a plugin and the new version doesn't get picked up when i put it in dropins)
Comment 5 George Dinkov CLA 2011-09-28 06:46:31 EDT
In SftpFileService there is a fDirChannelMutex  which handles the synchronization. Anyway I tried to wrap the call to fetchInfo with synchronized(fileStore) and I got the same error.
Comment 7 George Dinkov CLA 2011-09-28 11:35:46 EDT
I tried it using Jsch 0.1.44 and the same thing happened.
Comment 8 Martin Oberhuber CLA 2011-09-29 08:46:50 EDT
I can reproduce the problem on Win7 / Java1.6.0_21 client with an RHEL4.8 host. The non-debugging test runs into the exception at iteration 62. The debugging test is 100% reproducable.

What happens is that JSch Session.run() runs into this exception:
   java.io.IOException: Read end dead
in line 1297 (channel.write....), the exception is swallowed and the Channel closed:
  try{channel.disconnect();}catch(Exception ee){}
break;

This seems to be timing dependent. The test with 3 Threads + breakpoints runs fine when they are all left running. However, when I single-step the 1st Thread over sendLSTAT so some time passes after sending the LSTAT command and before reading the LSTAT response, the issue occurs.

Atsuhiko, does this ring a bell for you? Can it happen that the LSTAT response is consumed by somebody else when the reader is too slow to read ... and that kills the proper sequence of packets?

To me, this appears to be a Threading problem in JSch since from RSE point of view all requests are perfectly serialized over a single channel (in the 3 Thread / breakpoint exercise).

As a hint for debugging / fixing this: In the Plug-in Development Perspective, switch to the Plugins view, select "com.jcraft.jsch" and "Import as Source Project". Now you can modify JSch (eg for adding debug printouts etc). The error notified for Compression.java is not relevant since jzlib compression is optional.
Comment 9 Martin Oberhuber CLA 2011-09-29 08:48:39 EDT
PS I tested with JSch-0.1.44.
Comment 10 Martin Oberhuber CLA 2011-09-29 09:12:19 EDT
Ok, I think I know what's going on here.

The Sun/Oracle implementation of PipedInputStream somehow assumes that once a Thread has read from a PipedInputStream, that same Thread will be reading the rest from that Stream too.... at least until somebody else reads.

As of JDK 1.5, in java.io.PipedInputStream#read() we have
    readSide = Thread.currentThread();
and when somebody writes to that pipe it first checks whether the reader Thread is still alive.

Now in this testcase, we have a single Channel only.
- Thread 1 performs LSTAT on the Channel (it writes and reads, so it's 
  identified as the Thread associated with the Channel).
- Now Thread 1 terminates.
- Thread 2 takes over, and writes to the Channel.

Now when the remote side is _faster_ sending the packet response than the Thread can start reading, the PipedInputStream will still think that Thread 1 is the expected receiver. But Thread 1 has already terminated, so the "Read end dead" exception is thrown.

This is relatively unlikely in real world... but with eg 60 Threads running in parallel, it can happen that the LSTAT Thread is interrupted just after sending its LSTAT request and not scheduled again before a response comes from the remote.

I'm not exactly sure how to best fix this... options include
 (a) Not using Sun's PipedInputStream implementation, but a different one 
     that doesn't check what Thread reads or writes
 (b) Encapsulate the sendLSTAT and related read with some synchronization,
     such that the outer Session can't try to send a response before the
     Channel is ready to read (doesn't seem very elegant to me)
 (c) Doing a fake / dummy read on the Channel's input stream before sending
     the write [or taking some other precaution that the correct Thread is
     associated with the Channel]

I'm most leaning towards option a sine I assume that a very similar problem may occur in other cases as well, when multiple Threads use the same Channel in a serialized fashion. Or perhaps JSch elects to forbid such use of its Channel, then we'd need to dream something up on the TM/RSE side.
Comment 11 Martin Oberhuber CLA 2011-09-29 09:21:49 EDT
One alternative implementation of PipedInputStream is used in the tm.terminal:

http://dev.eclipse.org/viewcvs/viewvc.cgi/org.eclipse.tm.core/terminal/org.eclipse.tm.terminal/src/org/eclipse/tm/internal/terminal/textcanvas/PipedInputStream.java?root=Tools_Project&view=log

we originally adopted this implementation because we saw performance problems
with Sun's implementation in Java 1.4:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4404700
the code has been rewritten since (Java 1.5 "Tiger") so the Performance problem
isn't there any more, but as we see here the extroneous Threading check still
hurts.

I'll leave it to Atsuhiko at this point for comments.
Comment 12 Martin Oberhuber CLA 2011-09-30 04:02:10 EDT
I think that independent of any improvements in JSch and/or PipedInputStream for allowing more Threads to access a single Channel, we should fix this in RSE by making sure that only a single Thread accesses each channel.

Conceptually, the fix is not hard though quite some code will be affected: We already serialize Channel access through fDirChannelMutex, now we'll also need to make sure that all requests will be executed by a single Thread (executor pattern).

Essentially, this boilerplate code in SftpFileService:

    if (fDirChannelMutex.waitForLock(monitor, fDirChannelTimeout)) {
        try {
            /* operation */
        } finally {
            fDirChannelMutex.release();
        }
    } else {
        throw new SystemLockTimeoutException(Activator.PLUGIN_ID);
    }

needs to be replaced by the following everywhere:

    fDirChannelExecutor.syncExec(monitor, fDirChannelTimeout, new Runnable() {
        public void run() {
            /* operation */
        }
    });

with the fDirChannelExecutor being a new class that serializes incoming requests and delegates to a single Thread for execution; progress monitoring and cancellation semantics should remain as they are today for the fDirChannelMutex. Executor implementations already exist (for instance in java.util.concurrent: Executors#newSingleThreadExecutor() - I'm fine with increasing the SSH BREE to J2SE-1.5 at this time to get access to that package).

Reducing the boilerplate code will make our code also more readable and safer, and centralizes the Threading / serialization logic in a single place that's re-usable.

It doesn't seem to be very hard, does anybody want to give it a try and contribute a patch? - I've set the helpwanted keyword and will be reviewing / accepting contributions.
Comment 13 Atsuhiko Yamanaka CLA 2011-09-30 04:31:36 EDT
(In reply to comment #10)
> Ok, I think I know what's going on here.
> 
... 
> I'm not exactly sure how to best fix this... options include
>  (a) Not using Sun's PipedInputStream implementation, but a different one 
>      that doesn't check what Thread reads or writes
>  (b) Encapsulate the sendLSTAT and related read with some synchronization,
>      such that the outer Session can't try to send a response before the
>      Channel is ready to read (doesn't seem very elegant to me)
>  (c) Doing a fake / dummy read on the Channel's input stream before sending
>      the write [or taking some other precaution that the correct Thread is
>      associated with the Channel]

Thank for investigating the problem.
I'll check our code, Sun's code etc., and think what should be done.
Comment 14 Martin Oberhuber CLA 2011-09-30 04:49:24 EDT
(In reply to comment #13)
> I'll check our code, Sun's code etc., and think what should be done.

I've done some checks and thoughts already ... from my POV the Sun PipedInputStream just doesn't support multiple Threads accessing a Stream, any work around that would be awkward. That leaves 2 options for JSch, essentially...

(a) document that only a single Thread may operate on a Channel,
    (we'll fix RSE here under this assumption), or
(b) ship your own PipedInputStream implementation.

I thought a bit about why Sun might have chosen to check whether the Reader Thread is still alive in the Pipe... maybe they did it to ensure that a writer doesn't wait forever if the buffer is full and the Reader is dead.

So IMO there's a little bit of risked involved with option b and allowing arbitrary Threads on a Stream, since any Thread failing to read properly could mess up the entire sequence of events ... and if you're not careful it could perhaps even block the entire Session.

On the other hand, maybe some clients of JSch implicitly expect that they're allowed to use multiple Threads today so they might not like limiting the API.
Your call...
Comment 15 Atsuhiko Yamanaka CLA 2011-09-30 10:25:29 EDT
(In reply to comment #13)
> (In reply to comment #10)
> > Ok, I think I know what's going on here.
> > 
> ... 
> > I'm not exactly sure how to best fix this... options include
> >  (a) Not using Sun's PipedInputStream implementation, but a different one 
> >      that doesn't check what Thread reads or writes
> >  (b) Encapsulate the sendLSTAT and related read with some synchronization,
> >      such that the outer Session can't try to send a response before the
> >      Channel is ready to read (doesn't seem very elegant to me)
> >  (c) Doing a fake / dummy read on the Channel's input stream before sending
> >      the write [or taking some other precaution that the correct Thread is
> >      associated with the Channel]
> 
> Thank for investigating the problem.
> I'll check our code, Sun's code etc., and think what should be done.

Could you try the following version?
  http://www.jcraft.com/jsch/jsch-0.1.45.bug359184.zip
It will continue to use Sun's PipedInputStream, 
but try to update PipedInputStream#readSide before reading it.
Comment 16 Martin Oberhuber CLA 2011-10-03 10:43:08 EDT
Created attachment 204451 [details]
Binary JSch-0.1.45 bundle

Thanks Atsuhiko, the fix seems to work - my testcase is running 1200 iterations now without running into the exception.

I have attached a binary build of the patched JSch for others to try. Just download attached into your (Eclipse)/dropins directory and launch.
Comment 17 George Dinkov CLA 2011-10-04 05:50:07 EDT
Thanks Atsuhiko, Martin,
the fix seems to work here too. It's now 140000 iterations without any exceptions.
Comment 18 Martin Oberhuber CLA 2011-10-04 08:27:25 EDT
Thanks George.

Is the solution with attached JSch plugin acceptable for you, such that I could close this bug as "NOT_ECLIPSE" (since the problem is in JSch / PipedInputStream)?

Or do you need a solution in RSE which would work with the older JSch as well?
Comment 19 George Dinkov CLA 2011-10-04 09:45:27 EDT
I think it's acceptable for me. I can open another bug to upgrade JSch version used in Eclipse.
Comment 20 Martin Oberhuber CLA 2011-10-12 04:33:39 EDT
Atsuhiko, many thanks for the quick turnaround (and the clever way fixing this - directly manipulating the protected PipedInputStream's buffer didn't occur to me).

When do you expect JSch-0.1.45 to be released? If we want the Juno Eclipse train to pick up JSch-0.1.45 I think we should start the process of IP review and integrating into Eclipse Platform fairly soon.
Comment 21 Atsuhiko Yamanaka CLA 2011-10-14 02:46:47 EDT
(In reply to comment #20)
> Atsuhiko, many thanks for the quick turnaround (and the clever way fixing this
> - directly manipulating the protected PipedInputStream's buffer didn't occur to
> me).

I have a concern about that change, because it has depended on the internal
implementation of Sun's PipedInputStream.  I have confirmed that change does not
cause the problem for GCJ(GNU's classpath).  Does somebody try it on IBM's JDK,
or has java.io package of IBM's JDK come from Sun?

> When do you expect JSch-0.1.45 to be released? If we want the Juno Eclipse
> train to pick up JSch-0.1.45 I think we should start the process of IP review
> and integrating into Eclipse Platform fairly soon.

I agree it should be released as soon as possible.
I'll try it.
Comment 22 Martin Oberhuber CLA 2012-05-11 05:46:38 EDT
JSch-0.1.46 is now in Orbit so this can be fixed now.
Comment 23 Martin Oberhuber CLA 2012-05-22 14:46:21 EDT
I'm going to claim this is fixed by the JSch-0.1.46 bundle in Platform
Eclipse 3.8rc1 / 4.2rc1 .

I'm currently not planning re-distributing the newer JSch ourselves.