This Bugzilla instance is deprecated, and most Eclipse projects now use GitHub or Eclipse GitLab. Please see the deprecation plan for details.
Bug 260777 - [ssh] Deadlock when changing selection after multiple hibernate/resume cycles
Summary: [ssh] Deadlock when changing selection after multiple hibernate/resume cycles
Status: RESOLVED FIXED
Alias: None
Product: Target Management
Classification: Tools
Component: RSE (show other bugs)
Version: 3.1   Edit
Hardware: PC Windows XP
: P2 critical (vote)
Target Milestone: 3.1 M5   Edit
Assignee: David McKnight CLA
QA Contact: Martin Oberhuber CLA
URL:
Whiteboard:
Keywords:
: 235494 (view as bug list)
Depends on:
Blocks:
 
Reported: 2009-01-12 15:25 EST by Martin Oberhuber CLA
Modified: 2009-08-10 09:52 EDT (History)
0 users

See Also:


Attachments
Thread dump of deadlocked Eclipse (13.63 KB, text/plain)
2009-01-12 15:26 EST, Martin Oberhuber CLA
no flags Details
alternative to synchronizing on the connector service (6.46 KB, patch)
2009-01-13 17:31 EST, David McKnight CLA
no flags Details | Diff
updated patch to use wait/notify in ConnectorServicePool (3.91 KB, patch)
2009-01-21 18:58 EST, David McKnight CLA
no flags Details | Diff
Additional patch to clean up (7.16 KB, text/plain)
2009-08-10 09:52 EDT, Martin Oberhuber CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Oberhuber CLA 2009-01-12 15:25:40 EST
* Have an open SSH Only connection with an expanded Sftp tree.
* Switch to a different perspective.
* Hibernate/Resume the computer few times, changing network connections
  (apparently this disconnects the SSH/Sftp connection).
* Switch back to the RSE perspective, click somewhere in the tree.

--> The whole IDE gets trapped in deadlock with attach thread dump.

This is a critical issue since data in any open editors is lost at this point.
Comment 1 Martin Oberhuber CLA 2009-01-12 15:26:17 EST
Created attachment 122331 [details]
Thread dump of deadlocked Eclipse
Comment 2 Martin Oberhuber CLA 2009-01-12 15:37:15 EST
Analyzing the Thread dump, there are several issues.

1. Worker-14 / SystemFetchOperation.java:264
   Between SystemFetchOperation#run() and SystemFetchOperation#execute(),
   the "synchronized" lock on SshConnectorService must not be held while
   calling out to other code later (with Display.syncExec()). This is the
   recipe for deadlock!

2. main Thread: Locks on Display and SshConnectorService are now required
   in reverse order compared to Worker-14 -- this is causing the deadlock.
   - Why do we run into FileServiceSubSystem.getRemoteFileObject() when
     we should already know that we're disconnected? We shouldn't ever 
     auto-reconnect in this code branch.

This is exactly the situation that I have already described in bug 235494. So, here is the proof that this deadlock can in fact happen. I'd like to see this fixed by M5, since I personally lost some data due to this deadlock.
Comment 3 Martin Oberhuber CLA 2009-01-12 15:38:01 EST
*** Bug 235494 has been marked as a duplicate of this bug. ***
Comment 4 David McKnight CLA 2009-01-13 17:31:16 EST
Created attachment 122467 [details]
alternative to synchronizing on the connector service

Here's an approach that avoids synchronizing on the connector service.  Instead we maintain a static list of connector services which are connecting.  If a connector service is already connecting, then we sleep until it completes before continuing on.  Martin, what are your thoughts on this approach?
Comment 5 Martin Oberhuber CLA 2009-01-21 13:36:16 EST
Hm... the basic approach looks good, but there are a couple of issues:

1.) Whenever you add something to the _connectingConnectorServices, you must
    be 100% sure that you also remove it again in case an exception happens:
    try...
    ...finally (synchronized(_connectingConnectorServices) {
            _connectingConnectorServices.remove(cs);
      }

2.) Instead of Thread.sleep(500), better do
       synchronized(_connectingConnectorServices) {
         _connectingConnectorServices.wait();
       }
    and, when you remove it, you do
       synchronized(_connectingConnectorServices) {
         _connectingConnectorServices.notifyAll();
       }

   In order to avoid some typing work, and the synchornized statements,
   instead of the dumb List you should use a separate class that does all 
   the synchronized/wait stuff for you (implementation would use your List):

   private static class ConnectorServicePool {
     public synchronized void add(IConnectorService cs) {}
     public synchronized void remove(IConnectorService cs) {}
     public synchronized boolean contains(IConnectorService cs) {}
     public synchronized void waitUntilNotContained(IConnectorService cs) {}
   }

3.) After waitUntilNotContained() you still need to check whether it's
    actually connected (because the connect could have been canceled
    due to an exception).
Comment 6 David McKnight CLA 2009-01-21 18:58:31 EST
Created attachment 123309 [details]
updated patch to use wait/notify in ConnectorServicePool

This updated patch uses the suggested ConnectorServicePool class and wait/notify.
Comment 7 David McKnight CLA 2009-01-22 11:43:36 EST
I've committed the change to cvs.
Comment 8 Martin Oberhuber CLA 2009-08-10 09:52:30 EDT
Created attachment 143901 [details]
Additional patch to clean up

Attached additional patch makes the fix safer by putting the try...finally construct right after adding into the COnnectorServicePool, so it is guaranteed that it gets removed under all circumstances.

This is in line with the newer fix for bug 284018 in SubSystem.java.