| Summary: | [terminal] Regression: Terminal is unusable after "Disconnect" | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Tools] Target Management | Reporter: | Martin Oberhuber <mober.at+eclipse> | ||||||||||
| Component: | Terminal | Assignee: | Martin Oberhuber <mober.at+eclipse> | ||||||||||
| Status: | RESOLVED FIXED | QA Contact: | Martin Oberhuber <mober.at+eclipse> | ||||||||||
| Severity: | critical | ||||||||||||
| Priority: | P1 | CC: | aleherb+eclipse, anna.dushistova, eclipse, pawel.1.piech, uwe.st | ||||||||||
| Version: | 3.3 | Flags: | uwe.st:
review+
anna.dushistova: review+ mober.at+eclipse: review? (dmcknigh) mober.at+eclipse: review? (eclipse) |
||||||||||
| Target Milestone: | 3.3 RC4 | ||||||||||||
| Hardware: | All | ||||||||||||
| OS: | All | ||||||||||||
| Whiteboard: | |||||||||||||
| Bug Depends on: | 333613 | ||||||||||||
| Bug Blocks: | 348709, 348743 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Martin Oberhuber
Created attachment 197578 [details]
patch
Attached patch fixes the issue, by interrupting the reader Thread, rather than closing the PipedInputStream.
This seemed to be the least intrusive change; I chose to keep the changes in PipedInputStream.
For the next release, we may want to look at dealing with the PipedInputStream in a cleaner manner (i.e. close it, and re-create when a new connection is created). For this release, this seemed too risky since the other end of the pipe may be held on to by many clients.
As per our ramp-down plan, I need a review+ for this change from all available committers, in order to get this into 3.3rc4. I have entered bug 348709 to follow up on the current fix early in 3.4 - that bug also has a list of potential problems to check. Created attachment 197593 [details]
patch v2 - fixes disconnect order
Thanks Anna and Uwe for review.
After extensive review, I'm attaching a better patch which also restores the original order before Pawel's change (first close the remote streams, then cancel the job).
I think this is important in order to deal with race conditions and massive amount of data being sent from the remote.
I'm committing this and release into 3.3rc4.
Created attachment 197603 [details]
Patch to stop the connect on interrupt exception
I'd add this change on top of the proposed patch.
1. since we use interrupt, we should thread the interrupt exception as a cancel, therefore I treat an InterruptException in waitForAvailable as cances (this might not be needed but I think it makes the intent clear).
2. With some unlucky timing it is possible to interrupt another job: if our job is just about to finish, we get the thread of the job. Now the job is finished and we still have the thread. But the thread is already running another job. In this case we would interrupt this other job.
My solution is to have (synchronized) variable that holds the thread only as long as our job is running.
(In reply to comment #5) > 2. With some unlucky timing it is possible to interrupt another job: if our job > is just about to finish, we get the thread of the job. Now the job is finished > and we still have the thread. But the thread is already running another job. In > this case we would interrupt this other job. > > My solution is to have (synchronized) variable that holds the thread only as > long as our job is running. In the patch re-setting the fThread variable to null is still outside a sync block? Could also use an AtomicReference. (In reply to comment #6) it is inside a synchronized block. If you apply the patch the code looks like: synchronized (VT100TerminalControl.this) { if (fJob==this) { fJob=null; } // we are finished with the job -- no need to // interrupt it anymore fJobThread=null; } An atomic variable would not help because we have to make sure that we do the interrupt while the job is running. Suppose we would use a volatile or a AtomicReference: 01 Thread thread=(Thread)fAtomicReferenceToJobThread.get(); 02 if(thread!=null) 03 thread.interrupt(); suppose, we are get the thread while the job is running. job is not null. If our thread waits (because the scheduler puts us on wait on line 02) and the job finishes and another job is started then we call the thread.interrupt() after our job has finished and we might interrupt another unrelated job (e.g. another connection). With the synchronized block, we are sure that the job is still running if fJobThread is not null and we call interrupt for the our job: synchronized (this) { // make sure to interrupt the job ONLY if our job is running if(fJobThread!=null) { fJobThread.interrupt(); } } This would be unlucky timing -- but many threading problems are unlucky timing ;-) Michael, I think your scenario is EXTREMELY unlikely to happen - JobManager#endJob, among other things, performs a setThread(null) - and the patch which introduces a SINGLE fJobThread is potentially problematic because while we're still canceling the old Job, a new Job can already be scheduled (we've had problems along these lines with the Serial connector in the past). Also we're running out of time for Indigo, I can't easily re-release. I suggest filing a new bug for follow-up in 3.3.1 to deal with this edge case. I'd assume that the only safe way handling this is refactoring the anonymous "Terminal Data Reader" job into a named class and introducing a cancelNow() method or similar on it. Created attachment 197643 [details]
a patch that avoids race conditions
Martin, I agree, it is extremely unlikely, but that is often the case with threading issues....
You are right, if a new job is created while the old one is canceled then the single fJobThread variable could cause a problem. This is in the even more unlikely case when a connectTerminal is called between the two synchronized blocks in the disconnectTerminal method.
This can only happen, if VT100TerminalControl.connectTerminal is called form another thread than VT100TerminalControl.disconnectTerminal.
Looking at the code, this seems to be a possible!!!
E.g. the job itself can call disconnectTerminal as reaction to an error. In a few other places this can happen as well when exceptions are thrown.
This patch should avoid the problem. There can never be more than one job running at a time.
The new patch interrupts the Thread before setting the canceled status. With bad luck, the Thread receives the interrupt signal while not in waitForAvailable(), and the cancel flag is set while in waitForAvailable -- with the result that job.join() takes 500 msec. I have released my patch v2 into TM I20110608-0935 (3.3RC4) as well as Indigo. I appreciate the discussion, but I think we'll need to follow up in a separate bug for post-Indigo tuning: bug 348709 is already open for this, please move discussions there. |