Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 321032

Summary: [RemoteSvcs][XMPP] Deadlock in ECFConnection.disconnect()
Product: [RT] ECF Reporter: Markus Kuppe <bugs.eclipse.org>
Component: ecf.providersAssignee: Scott Lewis <slewis>
Status: RESOLVED NOT_ECLIPSE QA Contact:
Severity: normal    
Priority: P3 CC: slewis
Version: unspecified   
Target Milestone: ---   
Hardware: All   
OS: All   
Whiteboard:
Attachments:
Description Flags
screenshot
none
the other thread none

Description Markus Kuppe CLA 2010-07-27 11:21:43 EDT
The build https://build.ecf-project.org/hudson/job/L-HEAD-sdk.tests.feature/203/consoleFull hangs repeatedly due to a deadlock in org.eclipse.ecf.internal.provider.xmpp.smack.ECFConnection.disconnect() and org.jivesoftware.smack.XMPPConnection.shutdown(Presence) subsequently. Unfortunately I can only provide a screenshot as remote debugging the from around the world takes ages.
Comment 1 Markus Kuppe CLA 2010-07-27 11:22:25 EDT
Created attachment 175326 [details]
screenshot
Comment 2 Scott Lewis CLA 2010-07-27 12:50:37 EDT
Could you please attach a screenshot with the stack from the *other* deadlocked thread also?  Thanks.
Comment 3 Markus Kuppe CLA 2010-07-29 08:16:21 EDT
Created attachment 175490 [details]
the other thread
Comment 4 Markus Kuppe CLA 2010-07-29 08:39:19 EDT
Just a hunch, but maybe Mustafa's server does some rate limiting. Thus I'm in the process of changing the XMPP server used for testing.
Comment 5 Scott Lewis CLA 2010-07-29 08:43:42 EDT
(In reply to comment #4)
> Just a hunch, but maybe Mustafa's server does some rate limiting. Thus I'm in
> the process of changing the XMPP server used for testing.

If his server is doing rate limiting, you might want to also just see if he can turn it off.
Comment 6 Markus Kuppe CLA 2010-07-29 10:16:02 EDT
The same behavior happens with a different XMPP server instance (AFAIC both ejabberd).
Comment 7 Markus Kuppe CLA 2010-07-29 10:53:42 EDT
Build https://build.ecf-project.org/hudson/job/L-HEAD-sdk.tests.feature/ has been disabled until this issue is addressed.
Comment 8 Scott Lewis CLA 2010-07-29 13:05:09 EDT
In attachment 175490 [details] (the second thread), the output doesn't indicate what lock the Smack Packet reader thread is waiting for (the first thread posting/attachment 175326 [details] does indicate that it owns several locks and is waiting for the InputStreamReader).  Shouldn't the 'waiting for' info be in the output for this second thread as well?  

Also...how frequently/consistently does this deadlock occur in the test environment?  Currently, I'm not able to reproduce it locally, but if it occurs fairly regularly in some testing environment then perhaps that environment can be used.
Comment 9 Markus Kuppe CLA 2010-07-29 13:23:36 EDT
(In reply to comment #8)
> In attachment 175490 [details] (the second thread), the output doesn't indicate what lock
> the Smack Packet reader thread is waiting for (the first thread
> posting/attachment 175326 [details] does indicate that it owns several locks and is
> waiting for the InputStreamReader).  Shouldn't the 'waiting for' info be in the
> output for this second thread as well?  

Might be not a deadlock as per academic definition. ECFConnection.disconnect() waits for a lock held by "Smart packet reader" that never is released.

> Also...how frequently/consistently does this deadlock occur in the test
> environment?  Currently, I'm not able to reproduce it locally, but if it occurs
> fairly regularly in some testing environment then perhaps that environment can
> be used.

To me it looks like it happens fairly regular on build.eclipse.org. If you want to remote debug it, just enable and start L-HEAD-sdk.tests.feature. Past the compile step, the unit tests will open a remote debug port (just scan the console for e.g. "Listening for transport dt_socket at address: 25127") and attach to it via ssh (ssh -L 25127t:ecf2.osuosl.org:25127 ecf2.osuosl.org). When you start the build you can choose to have the VM wait for your connection before executing main.
Comment 10 Scott Lewis CLA 2010-07-29 13:44:22 EDT
(In reply to comment #9)
> (In reply to comment #8)
> > In attachment 175490 [details] [details] (the second thread), the output doesn't indicate what lock
> > the Smack Packet reader thread is waiting for (the first thread
> > posting/attachment 175326 [details] [details] does indicate that it owns several locks and is
> > waiting for the InputStreamReader).  Shouldn't the 'waiting for' info be in the
> > output for this second thread as well?  
> 
> Might be not a deadlock as per academic definition. ECFConnection.disconnect()
> waits for a lock held by "Smart packet reader" that never is released.

Hmmm.  So the first thread is trying to get the Smack packet reader thread (in order to call XMPPConnection.disconnect()), but can't because the Smack Packet Reader is blocked on SocketInputStream.socketRead0...which is indeed a blocking I/O call.

The second thread (Smack Packet Reader) doesn't have any of our code on the stack (it's all Smack code) and so doesn't have any of the same 3 locks that the first thread has (i.e. ECFConnection, ClientSOContainer$Lock, SOContainerGMM).

If these are indeed the only two threads involved in the deadlock, then it seems to me to be a Smack bug...as it's all code internal to Smack...apparently caused by calling XMPPConnection.disconnect at an inopportune time.  If that's right, I don't know how to prevent it without debugging/fixing Smack code (which would be the next step).  We *should* be able to call XMPPConnection.disconnect() at any time without deadlocking.

But before looking into the Smack code more deeply...are you certain there isn't another thread that has the SocketInputStream lock and is waiting for one of the three locks owned by the first thread?

> 
> > Also...how frequently/consistently does this deadlock occur in the test
> > environment?  Currently, I'm not able to reproduce it locally, but if it occurs
> > fairly regularly in some testing environment then perhaps that environment can
> > be used.
> 
> To me it looks like it happens fairly regular on build.eclipse.org. If you want
> to remote debug it, just enable and start L-HEAD-sdk.tests.feature. Past the
> compile step, the unit tests will open a remote debug port (just scan the
> console for e.g. "Listening for transport dt_socket at address: 25127") and
> attach to it via ssh (ssh -L 25127t:ecf2.osuosl.org:25127 ecf2.osuosl.org).
> When you start the build you can choose to have the VM wait for your connection
> before executing main.


How do I choose to have the VM wait for connection upon build start?
Comment 11 Markus Kuppe CLA 2010-07-29 14:03:23 EDT
(In reply to comment #10)
> How do I choose to have the VM wait for connection upon build start?

Once you hit build it takes you to a second page where it offers a drop down box that allows to set the build to suspend.
Comment 12 Markus Kuppe CLA 2010-07-29 14:07:56 EDT
(In reply to comment #10)
> (In reply to comment #9)
> > (In reply to comment #8)
> > > In attachment 175490 [details] [details] [details] (the second thread), the output doesn't indicate what lock
> > > the Smack Packet reader thread is waiting for (the first thread
> > > posting/attachment 175326 [details] [details] [details] does indicate that it owns several locks and is
> > > waiting for the InputStreamReader).  Shouldn't the 'waiting for' info be in the
> > > output for this second thread as well?  
> > 
> > Might be not a deadlock as per academic definition. ECFConnection.disconnect()
> > waits for a lock held by "Smart packet reader" that never is released.
> 
> Hmmm.  So the first thread is trying to get the Smack packet reader thread (in
> order to call XMPPConnection.disconnect()), but can't because the Smack Packet
> Reader is blocked on SocketInputStream.socketRead0...which is indeed a blocking
> I/O call.
> 
> The second thread (Smack Packet Reader) doesn't have any of our code on the
> stack (it's all Smack code) and so doesn't have any of the same 3 locks that
> the first thread has (i.e. ECFConnection, ClientSOContainer$Lock,
> SOContainerGMM).
> 
> If these are indeed the only two threads involved in the deadlock, then it
> seems to me to be a Smack bug...as it's all code internal to Smack...apparently
> caused by calling XMPPConnection.disconnect at an inopportune time.  If that's
> right, I don't know how to prevent it without debugging/fixing Smack code
> (which would be the next step).  We *should* be able to call
> XMPPConnection.disconnect() at any time without deadlocking.

Maybe our use of calling disconnect from within a disconnect listener is the problem. XMPPS.disconnect() > ECFConnection.disconnect() > XMPP.disconnect.

> But before looking into the Smack code more deeply...are you certain there
> isn't another thread that has the SocketInputStream lock and is waiting for one
> of the three locks owned by the first thread?

Due to the latency it takes a few minutes for me to even get the list of threads. Would be great if you could remote connect to the box.
Comment 13 Scott Lewis CLA 2010-07-29 14:25:40 EDT
(In reply to comment #12)
<stuff deleted>
> > right, I don't know how to prevent it without debugging/fixing Smack code
> > (which would be the next step).  We *should* be able to call
> > XMPPConnection.disconnect() at any time without deadlocking.
> 
> Maybe our use of calling disconnect from within a disconnect listener is the
> problem. XMPPS.disconnect() > ECFConnection.disconnect() > XMPP.disconnect.

It's the main thread that calls XMPPS.disconnect() > ... > XMPPConnection.disconnect()...not a Smack internal thread...so I don't think this is the issue.

I took a quick look at the Smack disconnect() code, and it doesn't seem right to me...i.e. here's the shutdown method:

    protected void shutdown(Presence unavailablePresence) {
        // Set presence to offline.
        packetWriter.sendPacket(unavailablePresence);

        this.setWasAuthenticated(authenticated);
        authenticated = false;
        connected = false;

        packetReader.shutdown();
        packetWriter.shutdown();
        // Wait 150 ms for processes to clean-up, then shutdown.
        try {
            Thread.sleep(150);   <<< SBL...this seems not at all good 
        }
        catch (Exception e) {
            // Ignore.
        }

        // Close down the readers and writers.
        if (reader != null) {
            try {
                reader.close();       <<< here is where it hangs
            }
            catch (Throwable ignore) { /* ignore */ }
            reader = null;
        }
        if (writer != null) {
            try {
                writer.close();
            }
            catch (Throwable ignore) { /* ignore */ }
            writer = null;
        }

        try {
            socket.close();
        }
        catch (Exception e) {
            // Ignore.
        }

        saslAuthentication.init();
    }


The above doesn't seem right to me...as it sends a unavailablePresence packet and waits 150ms before calling close on the reader, writer, and *then* the socket.  

> 
> > But before looking into the Smack code more deeply...are you certain there
> > isn't another thread that has the SocketInputStream lock and is waiting for one
> > of the three locks owned by the first thread?
> 
> Due to the latency it takes a few minutes for me to even get the list of
> threads. Would be great if you could remote connect to the box.


I may not have time immediately/today, but will try to get to it soon.  Please remind me if it drops off.
Comment 14 Scott Lewis CLA 2014-02-12 15:34:13 EST
This seems to have been fixed with Smack version upgrade.  If we experience it further, please reopen.