Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 337973 - [zoodiscovery] Remote service undiscovered and not rediscovered due to concurrency bug
Summary: [zoodiscovery] Remote service undiscovered and not rediscovered due to concur...
Status: RESOLVED FIXED
Alias: None
Product: ECF
Classification: RT
Component: ecf.providers (show other bugs)
Version: 3.5.0   Edit
Hardware: PC Windows XP
: P3 major (vote)
Target Milestone: 3.5.0   Edit
Assignee: ecf.core-inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-23 09:09 EST by Franky Bridelance CLA
Modified: 2011-03-14 04:37 EDT (History)
4 users (show)

See Also:


Attachments
Proposed swapping class (5.82 KB, application/octet-stream)
2011-02-28 11:36 EST, Ahmed Aadel CLA
no flags Details
mylyn/context/zip (1.07 KB, application/octet-stream)
2011-03-08 05:21 EST, Wim Jongman CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Franky Bridelance CLA 2011-02-23 09:09:05 EST
Build Identifier: 3.4

Hi, 

While using zoodiscovery I got a remote service undiscovered and it was not rediscovered again. Following error has been reported at the moment of undiscovery:

22-02-2011 20:47:04 [pool-1-thread-5] [org.apache.zookeeper.server.NIOServerCnxn] ERROR - Thread Thread[pool-1-thread-5,5,main] died
org.eclipse.core.runtime.AssertionFailedException: null argument:
            at org.eclipse.core.runtime.Assert.isNotNull(Assert.java:85)
            at org.eclipse.core.runtime.Assert.isNotNull(Assert.java:73)
            at org.eclipse.ecf.provider.zookeeper.node.internal.WatchManager.addZooKeeper(WatchManager.java:238)
            at org.eclipse.ecf.provider.zookeeper.node.internal.ReadRoot$1.run(ReadRoot.java:70)
            at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
            at java.lang.Thread.run(Unknown Source)

I had a quick look at the zoodiscovery code and it seems that the above exception happened because ReadRoot.this.readKeeper is null on line 70 in ReadRoot.java. I think the reason why readKeeper is null is because there's a race condition in the ReadRoot.process method. Although the process method is synchronized it uses a separate thread to process the WatchedEvent and the code within this thread is not thread safe.
For example, you get two ReadRoot.process calls quite fast after each other:
- one with WatchedEvent state Disconnected or Expired (A)
- one with WatchedEvent state SyncConnected (B)
Then you can have following:
                          enter ReadRoot.process(A)
                         / 
     start thread for A -
              |           exit ReadRoot.process(A)
              |           enter ReadRoot.process(B)
     enter case Disconnected                       \
     ReadRoot.this.isConnected=false                \
     enter connect()                                 - start thread for B
     ...                                                        |
     this.readKeeper.close()                                    |
              |                                  enter case SyncConnected
              |                                  if (!ReadRoot.this.isConnected)
     this.readKeeper = null                                     |
              |                                   ReadRoot.this.isConnected=true
     watchManager.removeZooKeeper(null) ????                    |
              |                                  watchManager.addZooKeeper(null) --> !!! AssertionFailedException !!!
                                                                   
???? --> That's another bug: in connect method removeZooKeeper will have no effect because this.readKeeper is already set to null. This means that the list of zooKeepers in watchManager will only grow. 

I've put the severity on Major because the remote service is not rediscovered.

Reproducible: Sometimes

Steps to Reproduce:
As for any concurrency bug it's hard to reproduce and I don't have any idea how I could increase the probability to reproduce it.
Comment 1 Scott Lewis CLA 2011-02-23 10:29:51 EST
Adding Markus, Wim, and Ahmed for comment.
Comment 2 Franky Bridelance CLA 2011-02-23 10:39:50 EST
Hmm, unfortunately my nice ascii diagram got messed up... I'll try to add it once more:

                     enter ReadRoot.process(A)
                    / 
start thread for A -
         |           exit ReadRoot.process(A)
         |           enter ReadRoot.process(B)
enter case Disconnected                       \
ReadRoot.this.isConnected=false                \
enter connect()                                 - start thread for B
...                                                        |
this.readKeeper.close()                                    |
         |                                  enter case SyncConnected
         |                                if (!ReadRoot.this.isConnected)
this.readKeeper = null                                     |
         |                                ReadRoot.this.isConnected=true
watchManager.removeZooKeeper(null) ????                    |
         |                              watchManager.addZooKeeper(null) 
        ...                            --> !!! AssertionFailedException !!!


(In reply to comment #0)
> Build Identifier: 3.4
> Hi, 
> While using zoodiscovery I got a remote service undiscovered and it was not
> rediscovered again. Following error has been reported at the moment of
> undiscovery:
> 22-02-2011 20:47:04 [pool-1-thread-5]
> [org.apache.zookeeper.server.NIOServerCnxn] ERROR - Thread
> Thread[pool-1-thread-5,5,main] died
> org.eclipse.core.runtime.AssertionFailedException: null argument:
>             at org.eclipse.core.runtime.Assert.isNotNull(Assert.java:85)
>             at org.eclipse.core.runtime.Assert.isNotNull(Assert.java:73)
>             at
> org.eclipse.ecf.provider.zookeeper.node.internal.WatchManager.addZooKeeper(WatchManager.java:238)
>             at
> org.eclipse.ecf.provider.zookeeper.node.internal.ReadRoot$1.run(ReadRoot.java:70)
>             at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> Source)
>             at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
>             at java.lang.Thread.run(Unknown Source)
> I had a quick look at the zoodiscovery code and it seems that the above
> exception happened because ReadRoot.this.readKeeper is null on line 70 in
> ReadRoot.java. I think the reason why readKeeper is null is because there's a
> race condition in the ReadRoot.process method. Although the process method is
> synchronized it uses a separate thread to process the WatchedEvent and the code
> within this thread is not thread safe.
> For example, you get two ReadRoot.process calls quite fast after each other:
> - one with WatchedEvent state Disconnected or Expired (A)
> - one with WatchedEvent state SyncConnected (B)
> Then you can have following:
>                           enter ReadRoot.process(A)
>                          / 
>      start thread for A -
>               |           exit ReadRoot.process(A)
>               |           enter ReadRoot.process(B)
>      enter case Disconnected                       \
>      ReadRoot.this.isConnected=false                \
>      enter connect()                                 - start thread for B
>      ...                                                        |
>      this.readKeeper.close()                                    |
>               |                                  enter case SyncConnected
>               |                                  if
> (!ReadRoot.this.isConnected)
>      this.readKeeper = null                                     |
>               |                                  
> ReadRoot.this.isConnected=true
>      watchManager.removeZooKeeper(null) ????                    |
>               |                                 
> watchManager.addZooKeeper(null) --> !!! AssertionFailedException !!!
> ???? --> That's another bug: in connect method removeZooKeeper will have no
> effect because this.readKeeper is already set to null. This means that the list
> of zooKeepers in watchManager will only grow. 
> I've put the severity on Major because the remote service is not rediscovered.
> Reproducible: Sometimes
> Steps to Reproduce:
> As for any concurrency bug it's hard to reproduce and I don't have any idea how
> I could increase the probability to reproduce it.
Comment 3 Ahmed Aadel CLA 2011-02-28 11:36:49 EST
Created attachment 189967 [details]
Proposed swapping class

Hi Franky,

Here is, attached, an altered version of class ReadRoot.java. If you're using zooDiscovery source in your workspace, you can just paste this class there instead of the old one and give it a try.
If this fixes the current issue, we'll commit to master, otherwise please do post traces and we'll handle it further.

Thanks.
Comment 4 Scott Lewis CLA 2011-03-01 14:25:23 EST
I would like to include the resolution of this bug in ECF 3.5, scheduled for release 3/14/2011.  Is this reasonable?  Would it be ok with people if I set the target milestone appropriately?

If this is not feasible then that's fine...I'll set the target milestone to ECF 3.6.
Comment 5 Wim Jongman CLA 2011-03-01 16:06:14 EST
3.5 is good.
Comment 6 Scott Lewis CLA 2011-03-01 17:01:30 EST
Setting target milestone to 3.5 as per comment 5.
Comment 7 Wim Jongman CLA 2011-03-08 05:21:58 EST
Created attachment 190640 [details]
mylyn/context/zip

Applied patch to master. Let the build run the unit tests.
Comment 8 Wim Jongman CLA 2011-03-08 05:40:59 EST
Franky, thanks for the great bug report. You can pick up the latest 3.5 build here 

https://build.ecf-project.org/jenkins/job/C-HEAD-sdk.feature/lastSuccessfulBuild/artifact/site.p2/

Please give it a spin.

Closing.
Comment 9 Franky Bridelance CLA 2011-03-14 04:37:10 EDT
(In reply to comment #8)
> Franky, thanks for the great bug report. You can pick up the latest 3.5 build
> here 
> https://build.ecf-project.org/jenkins/job/C-HEAD-sdk.feature/lastSuccessfulBuild/artifact/site.p2/
> Please give it a spin.
> Closing.

Sorry for the late reply. 
As I couldn't reproduce the problem I won't be able to say with certainty if the fix solves the problem but the fix rules out the concurrency issue so it should be fine. I'll update to 3.5 this week and give it a try. 

Thanks for the quick bug fix.

Franky