Community
Participate
Working Groups
Build Identifier: 3.4 Hi, While using zoodiscovery I got a remote service undiscovered and it was not rediscovered again. Following error has been reported at the moment of undiscovery: 22-02-2011 20:47:04 [pool-1-thread-5] [org.apache.zookeeper.server.NIOServerCnxn] ERROR - Thread Thread[pool-1-thread-5,5,main] died org.eclipse.core.runtime.AssertionFailedException: null argument: at org.eclipse.core.runtime.Assert.isNotNull(Assert.java:85) at org.eclipse.core.runtime.Assert.isNotNull(Assert.java:73) at org.eclipse.ecf.provider.zookeeper.node.internal.WatchManager.addZooKeeper(WatchManager.java:238) at org.eclipse.ecf.provider.zookeeper.node.internal.ReadRoot$1.run(ReadRoot.java:70) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) I had a quick look at the zoodiscovery code and it seems that the above exception happened because ReadRoot.this.readKeeper is null on line 70 in ReadRoot.java. I think the reason why readKeeper is null is because there's a race condition in the ReadRoot.process method. Although the process method is synchronized it uses a separate thread to process the WatchedEvent and the code within this thread is not thread safe. For example, you get two ReadRoot.process calls quite fast after each other: - one with WatchedEvent state Disconnected or Expired (A) - one with WatchedEvent state SyncConnected (B) Then you can have following: enter ReadRoot.process(A) / start thread for A - | exit ReadRoot.process(A) | enter ReadRoot.process(B) enter case Disconnected \ ReadRoot.this.isConnected=false \ enter connect() - start thread for B ... | this.readKeeper.close() | | enter case SyncConnected | if (!ReadRoot.this.isConnected) this.readKeeper = null | | ReadRoot.this.isConnected=true watchManager.removeZooKeeper(null) ???? | | watchManager.addZooKeeper(null) --> !!! AssertionFailedException !!! ???? --> That's another bug: in connect method removeZooKeeper will have no effect because this.readKeeper is already set to null. This means that the list of zooKeepers in watchManager will only grow. I've put the severity on Major because the remote service is not rediscovered. Reproducible: Sometimes Steps to Reproduce: As for any concurrency bug it's hard to reproduce and I don't have any idea how I could increase the probability to reproduce it.
Adding Markus, Wim, and Ahmed for comment.
Hmm, unfortunately my nice ascii diagram got messed up... I'll try to add it once more: enter ReadRoot.process(A) / start thread for A - | exit ReadRoot.process(A) | enter ReadRoot.process(B) enter case Disconnected \ ReadRoot.this.isConnected=false \ enter connect() - start thread for B ... | this.readKeeper.close() | | enter case SyncConnected | if (!ReadRoot.this.isConnected) this.readKeeper = null | | ReadRoot.this.isConnected=true watchManager.removeZooKeeper(null) ???? | | watchManager.addZooKeeper(null) ... --> !!! AssertionFailedException !!! (In reply to comment #0) > Build Identifier: 3.4 > Hi, > While using zoodiscovery I got a remote service undiscovered and it was not > rediscovered again. Following error has been reported at the moment of > undiscovery: > 22-02-2011 20:47:04 [pool-1-thread-5] > [org.apache.zookeeper.server.NIOServerCnxn] ERROR - Thread > Thread[pool-1-thread-5,5,main] died > org.eclipse.core.runtime.AssertionFailedException: null argument: > at org.eclipse.core.runtime.Assert.isNotNull(Assert.java:85) > at org.eclipse.core.runtime.Assert.isNotNull(Assert.java:73) > at > org.eclipse.ecf.provider.zookeeper.node.internal.WatchManager.addZooKeeper(WatchManager.java:238) > at > org.eclipse.ecf.provider.zookeeper.node.internal.ReadRoot$1.run(ReadRoot.java:70) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.lang.Thread.run(Unknown Source) > I had a quick look at the zoodiscovery code and it seems that the above > exception happened because ReadRoot.this.readKeeper is null on line 70 in > ReadRoot.java. I think the reason why readKeeper is null is because there's a > race condition in the ReadRoot.process method. Although the process method is > synchronized it uses a separate thread to process the WatchedEvent and the code > within this thread is not thread safe. > For example, you get two ReadRoot.process calls quite fast after each other: > - one with WatchedEvent state Disconnected or Expired (A) > - one with WatchedEvent state SyncConnected (B) > Then you can have following: > enter ReadRoot.process(A) > / > start thread for A - > | exit ReadRoot.process(A) > | enter ReadRoot.process(B) > enter case Disconnected \ > ReadRoot.this.isConnected=false \ > enter connect() - start thread for B > ... | > this.readKeeper.close() | > | enter case SyncConnected > | if > (!ReadRoot.this.isConnected) > this.readKeeper = null | > | > ReadRoot.this.isConnected=true > watchManager.removeZooKeeper(null) ???? | > | > watchManager.addZooKeeper(null) --> !!! AssertionFailedException !!! > ???? --> That's another bug: in connect method removeZooKeeper will have no > effect because this.readKeeper is already set to null. This means that the list > of zooKeepers in watchManager will only grow. > I've put the severity on Major because the remote service is not rediscovered. > Reproducible: Sometimes > Steps to Reproduce: > As for any concurrency bug it's hard to reproduce and I don't have any idea how > I could increase the probability to reproduce it.
Created attachment 189967 [details] Proposed swapping class Hi Franky, Here is, attached, an altered version of class ReadRoot.java. If you're using zooDiscovery source in your workspace, you can just paste this class there instead of the old one and give it a try. If this fixes the current issue, we'll commit to master, otherwise please do post traces and we'll handle it further. Thanks.
I would like to include the resolution of this bug in ECF 3.5, scheduled for release 3/14/2011. Is this reasonable? Would it be ok with people if I set the target milestone appropriately? If this is not feasible then that's fine...I'll set the target milestone to ECF 3.6.
3.5 is good.
Setting target milestone to 3.5 as per comment 5.
Created attachment 190640 [details] mylyn/context/zip Applied patch to master. Let the build run the unit tests.
Franky, thanks for the great bug report. You can pick up the latest 3.5 build here https://build.ecf-project.org/jenkins/job/C-HEAD-sdk.feature/lastSuccessfulBuild/artifact/site.p2/ Please give it a spin. Closing.
(In reply to comment #8) > Franky, thanks for the great bug report. You can pick up the latest 3.5 build > here > https://build.ecf-project.org/jenkins/job/C-HEAD-sdk.feature/lastSuccessfulBuild/artifact/site.p2/ > Please give it a spin. > Closing. Sorry for the late reply. As I couldn't reproduce the problem I won't be able to say with certainty if the fix solves the problem but the fix rules out the concurrency issue so it should be fine. I'll update to 3.5 this week and give it a try. Thanks for the quick bug fix. Franky