Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 348635

Summary: Resource manager connections closing for no apparent reason
Product: [Tools] PTP Reporter: Greg Watson <g.watson>
Component: Remote ToolsAssignee: Greg Watson <g.watson>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: arossi
Version: 5.0   
Target Milestone: 5.0.4   
Hardware: Macintosh   
OS: Mac OS X   
Whiteboard:

Description Greg Watson CLA 2011-06-07 15:46:21 EDT
For some reason, resource manager connections are closing spuriously causing the RM to transition to the stopped state.

Worse, Al has observed this on RMs that use completely independent connections, which points to some underlying problem in the remote tools framework.

This is extremely difficult to debug as it only occurs sporadically, but nevertheless has a major impact on usability.
Comment 1 Albert L. Rossi CLA 2011-06-07 16:05:14 EDT
I have observed the following behavior.

If I open three new RMs which have associated LML drivers (remote polling done once a minute), and follow this sequence of actions, I get the reported error below.

1. RM1 on Connection1 to host A
2. RM2 on Connection2 to host B
3. RM3 on Connection2 to host B

1. RM1 submits a job to the scheduler.
2. RM2 submits a job to the scheduler.
3. RM3 submits an "interactive" pseudoTerminal session in order to use the -I option on the scheduler.

When the job on RM1 completes, I activate a context menu action which attempts to stream the output file to the console.  The connection on the head node of A seems to be slow or the head node has a high load, so this stream can take up to minute even though the amount being streamed is not considerable (50-100k).  Because this overlaps with the next polling from the other two drivers, they attempt to run in the meantime.  But when the streaming completes (successfully; the output arrives in its entirety at the console), one of the other two RMs, usually the interactive one, reports a "pipe closed" and the RM goes into error mode.  It can be terminated, but the tread doing the polling needs to be canceled separately using the progress bar.

This looks like something at the level of connection is being shared that shouldn't be.

Al
Comment 2 Albert L. Rossi CLA 2011-06-08 16:56:43 EDT
Running on separate connections greatly reduces the chance of this happening, but it still does seem to pop up occasionally.

My gut feeling is that there is some non-thread-safe behavior in the underlying JSch classes.

Al
Comment 3 Greg Watson CLA 2011-06-08 16:58:15 EDT
This issue looks like Remote Tools connections are not thread safe for remote processes. I'm not sure if this is in Remote Tools or Jsch. The work around is to use separate connections for each RM. Lowering severity to normal and moving to Remote Tools. This will need to be fixed in an update release.
Comment 4 Albert L. Rossi CLA 2011-06-08 16:58:34 EDT
Also, the original diagnosis that it had to do with the long I/O read was a red herring.  It will happen on the boot of the LML driver if you share a connection between RMs.

Al
Comment 6 Greg Watson CLA 2011-10-18 12:04:12 EDT
I've added a fix that synchronizes the call to channel.connect(), which seems to fix the use of RemoteProcessBuilder from multiple threads. My tests are not showing any issues with this fix, but because this is a concurrency issue there may still be problems that manifest from time to time.

Closing as fixed in ptp_5_0 and HEAD.