Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 362278

Summary: [zoodiscovery] Missing remote service during republish after a zookeeper session timeout
Product: [RT] ECF Reporter: Franky Bridelance <franky.bugzilla>
Component: ecf.providersAssignee: Wim Jongman <wim.jongman>
Status: CLOSED FIXED QA Contact:
Severity: critical    
Priority: P3 CC: ahmed.aadel, bugs.eclipse.org, slewis, wim.jongman
Version: unspecified   
Target Milestone: 3.5.3   
Hardware: PC   
OS: Windows 7   
Whiteboard:
Attachments:
Description Flags
File containing fixes for the mentioned issues none

Description Franky Bridelance CLA 2011-10-28 04:16:42 EDT
Build Identifier: ECF 3.5.2

In zoodiscovery's WriteRoot class, when zookeeper has a session timeout a "Disconnected" event is sent to WriteRoot. As consequence of this, WriteRoot starts a reconnect and once "SyncConnected" event is received WriteRoot republishes all known remote services. What I see is that sometimes during the republish of the first remote service (in NodeWrite.publish) a KeeperException is thrown with code SESSIONEXPIRED (by ZooKeeper.exists method). This results that the first remote service is never republished (hence the critical severity). Btw this was hard to find because the "catch (KeeperException e)" in NodeWrite.publish method only logs the exception in case the KeeperException code is CONNECTIONLOSS...
I had a look at WriteRoot implementation and IMHO there are several small problems which could explain above issue:
-  Event processing in WriteRoot is done synchronously which means that the reconnect is done in the "dying" EventThread of zookeeper. I see that in ReadRoot event processing is done asynchronously (+ synchronization block), so I guess that in WriteRoot it should also be done asynchronously (+ synchronization block).
- In "Disconnected" and "Expired" event processing, isConnected is first set to false and then unpublishAll is called on watchManager which will call remove method on every NodeWriter. In this NodeWriter remove method first thing checked is if isConnected of WriteRoot is true to unpublish the remote service but it's always false... So no unpublish happens
- In WriteRoot.connect method there is following code: "this.writeKeeper = null; 				this.watchManager.removeZooKeeper(this.writeKeeper);". Because writeKeeper is set to null before the removeZooKeeper call the old writeKeeper is never removed.
- In WriteRoot.connect method a new ZooKeeper instance is created and assigned to writeKeeper member but it never gets added to the watchManager (like done in initWriteKeeper). In fact it's best to remove it from the initWriteKeeper and add it in the "SyncConnected" event processing (similar to ReadRoot).

I'll attach a modified WriteRoot file containing fixes for all above issues. When using this fixed WriteRoot class I cannot reproduce the missing remote services anymore.

One more observation (not part of the fixed WriteRoot): Maybe there's more behind it but IMHO using a cached thread pool to process events asynchronously (like done in ReadRoot and I took over in WriteRoot) is strange because during tests I saw a lot of events incoming which results in a lot of threads (30+) created to process those events but then they all need to wait because of the synchronized block (which is needed, believe me ;-)). I think a better pattern to process those events asynchronously is to have a queue and one worker thread reading out of the queue --> no big number of threads in case of a lot of incoming events and no synchronization block needed as each event is processed sequentially. That's just a thought...


Reproducible: Sometimes
Comment 1 Franky Bridelance CLA 2011-10-28 04:18:05 EDT
Created attachment 206116 [details]
File containing fixes for the mentioned issues
Comment 2 Scott Lewis CLA 2011-10-28 09:46:18 EDT
Adding Wim, Ahmed, and Marcus for examination/
Comment 3 Wim Jongman CLA 2011-10-30 16:12:05 EDT
Hi Franky,

thanks a lot for the patch. However, before I can accept it I have a few questions.

> severity). Btw this was hard to find because the "catch (KeeperException e)" in
> NodeWrite.publish method only logs the exception in case the KeeperException
> code is CONNECTIONLOSS...

Yes. Nice catch. Thanks for finding that.

> I had a look at WriteRoot implementation and IMHO there are several small
> problems which could explain above issue:
> -  Event processing in WriteRoot is done synchronously which means that the
> reconnect is done in the "dying" EventThread of zookeeper. I see that in
> ReadRoot event processing is done asynchronously (+ synchronization block), so
> I guess that in WriteRoot it should also be done asynchronously (+
> synchronization block).

Can you elaborate on the basis of that guess? Does this mean that this has nothing to do with the bug? The reason I ask is that we want to change as little as possible to fix this. 

> - In "Disconnected" and "Expired" event processing, isConnected is first set to
> false and then unpublishAll is called on watchManager which will call remove
> method on every NodeWriter. In this NodeWriter remove method first thing
> checked is if isConnected of WriteRoot is true to unpublish the remote service
> but it's always false... So no unpublish happens

Yes, that sounds logical. However, unpublishing happens AFAIK. We have to dig thru this some more.

> - In WriteRoot.connect method there is following code: "this.writeKeeper =
> null;                 this.watchManager.removeZooKeeper(this.writeKeeper);".
> Because writeKeeper is set to null before the removeZooKeeper call the old
> writeKeeper is never removed.

Yes the call with null does not make sense. Good catch. 


> - In WriteRoot.connect method a new ZooKeeper instance is created and assigned
> to writeKeeper member but it never gets added to the watchManager (like done in
> initWriteKeeper). In fact it's best to remove it from the initWriteKeeper and
> add it in the "SyncConnected" event processing (similar to ReadRoot).

I'm not sure if this is really needed. We need to investigate.

> 
> I'll attach a modified WriteRoot file containing fixes for all above issues.
> When using this fixed WriteRoot class I cannot reproduce the missing remote
> services anymore.

Thanks.

> 
> One more observation (not part of the fixed WriteRoot): Maybe there's more
> behind it but IMHO using a cached thread pool to process events asynchronously
> (like done in ReadRoot and I took over in WriteRoot) is strange because during
> tests I saw a lot of events incoming which results in a lot of threads (30+)
> created to process those events but then they all need to wait because of the
> synchronized block (which is needed, believe me ;-)). 

I want to believe but I am not religious ;). You mean the sync block you added or the sync block in ReadRoot? Can you elaborate? 


We try to fix this for 3.5.3 which is due tomorrow. It will probably not make it..
Comment 4 Franky Bridelance CLA 2011-11-07 04:01:46 EST
Hi Wim,

Sorry for my late reply, I was on holiday and had limited email access. See replies inline.

(In reply to comment #3)
> Hi Franky,
> 
> thanks a lot for the patch. However, before I can accept it I have a few
> questions.
> 
> > severity). Btw this was hard to find because the "catch (KeeperException e)" in
> > NodeWrite.publish method only logs the exception in case the KeeperException
> > code is CONNECTIONLOSS...
> 
> Yes. Nice catch. Thanks for finding that.
> 
> > I had a look at WriteRoot implementation and IMHO there are several small
> > problems which could explain above issue:
> > -  Event processing in WriteRoot is done synchronously which means that the
> > reconnect is done in the "dying" EventThread of zookeeper. I see that in
> > ReadRoot event processing is done asynchronously (+ synchronization block), so
> > I guess that in WriteRoot it should also be done asynchronously (+
> > synchronization block).
> 
> Can you elaborate on the basis of that guess? Does this mean that this has
> nothing to do with the bug? The reason I ask is that we want to change as
> little as possible to fix this. 

I have no hard proof  that the synchronous event handling is the root cause of the problem but by making the event handling asynchronous in WriteRoot I could not reproduce the problem anymore. So I guess it's the root cause. Btw to reproduce the problem, I changed the system time a number of seconds in the future (can happen when synched to an NTP server) to trigger a zookeeper session timeout. 

> 
> > - In "Disconnected" and "Expired" event processing, isConnected is first set to
> > false and then unpublishAll is called on watchManager which will call remove
> > method on every NodeWriter. In this NodeWriter remove method first thing
> > checked is if isConnected of WriteRoot is true to unpublish the remote service
> > but it's always false... So no unpublish happens
> 
> Yes, that sounds logical. However, unpublishing happens AFAIK. We have to dig
> thru this some more.
> 
> > - In WriteRoot.connect method there is following code: "this.writeKeeper =
> > null;                 this.watchManager.removeZooKeeper(this.writeKeeper);".
> > Because writeKeeper is set to null before the removeZooKeeper call the old
> > writeKeeper is never removed.
> 
> Yes the call with null does not make sense. Good catch. 
> 
> 
> > - In WriteRoot.connect method a new ZooKeeper instance is created and assigned
> > to writeKeeper member but it never gets added to the watchManager (like done in
> > initWriteKeeper). In fact it's best to remove it from the initWriteKeeper and
> > add it in the "SyncConnected" event processing (similar to ReadRoot).
> 
> I'm not sure if this is really needed. We need to investigate.
> 
> > 
> > I'll attach a modified WriteRoot file containing fixes for all above issues.
> > When using this fixed WriteRoot class I cannot reproduce the missing remote
> > services anymore.
> 
> Thanks.
> 
> > 
> > One more observation (not part of the fixed WriteRoot): Maybe there's more
> > behind it but IMHO using a cached thread pool to process events asynchronously
> > (like done in ReadRoot and I took over in WriteRoot) is strange because during
> > tests I saw a lot of events incoming which results in a lot of threads (30+)
> > created to process those events but then they all need to wait because of the
> > synchronized block (which is needed, believe me ;-)). 
> 
> I want to believe but I am not religious ;). You mean the sync block you added
> or the sync block in ReadRoot? Can you elaborate? 

Both sync block in ReadRoot and WriteRoot are needed if events are processed asynchronously with a cached thread pool. The "believe me" was just referring to the fact that I first changed WriteRoot to process the events asynchronously but without the sync block and I got tens of zookeeper eventThread and sendThread instances. 

> 
> 
> We try to fix this for 3.5.3 which is due tomorrow. It will probably not make
> it..
Comment 5 Ahmed Aadel CLA 2011-11-08 09:33:34 EST
Hi Franky,

>What I see is that sometimes during the
>republish of the first remote service (in NodeWrite.publish) a KeeperException
>is thrown with code SESSIONEXPIRED (by ZooKeeper.exists method). This results
>that the first remote service is never republished (hence the critical
>severity). Btw this was hard to find because the "catch (KeeperException e)" in
>NodeWrite.publish method only logs the exception in case the KeeperException
>code is CONNECTIONLOSS...
 
That method's semantics was meant so, and as you can see nothing (except from logging) is done in catch clauses, as it's not 
the task of a NodeWriter instance to recover from connectionless and session timeouts cases by itself. Actually, it just cannot.
That task is dealt with centrally in WriteRoot where, anyway, connection-loss and timeouts are detected and dealt with by attempting to republish all (at that time known) services, thus inclusive the ones that failed middle-way.

If you're getting timeouts, then it's more likely related to the way we create zookeeper clients. Currently we build a writing zooKeeper client with a sessionTimeout of 3000ms as in code:
 
this.writeKeeper = new ZooKeeper(this.ip, 3000 <--sessionTimeout , this);
in WriteRoot class.

As a first attempt helping this bug away (which is a very valuable bug case due its reproduction rarity), you'd rather, please, try widening that session time-out in those places. If that solves the timing out problem, then we can externalize that sessionTimeout setting somehow to give users more control.


>Event processing in WriteRoot is done synchronously which means that the
>reconnect is done in the "dying" EventThread of zookeeper. I see that in
>ReadRoot event processing is done asynchronously (+ synchronization block), so
>I guess that in WriteRoot it should also be done asynchronously (+
>synchronization block).

The whole WriteRoot object runs on separate thread. This class deals primarily with ZooKeeper lifecycle call backs in contrast to node monitoring done by readers which is asyn on call basis. Yet, if wrapping that method's body
in a separte thread similar to its counterpart in ReadRoot helps, I'm OK with it.

>- In "Disconnected" and "Expired" event processing, isConnected is first set to
>false and then unpublishAll is called on watchManager which will call remove
>method on every NodeWriter. In this NodeWriter remove method first thing
>checked is if isConnected of WriteRoot is true to unpublish the remote service
>but it's always false... So no unpublish happens

The unpublishing action has 2 phases: First, removing the service instances from the collection of services known to the local ZooDiscovery, then
deleting each service's node representation from the zooKeeper server. Hence, the second is only relevant for explicit service unregistration calls (e.g. ZooDiscoveryContainer.unregisterService(),unregisterService(IServiceInfo)) PROVIDED we're still connected, of course.
In the other 2 cases, "Disconnected" and "Expired", the second phase is neither needed nor possible and the being-connected-to server will automatically 
detect our fall away leading it to delete the services/nodes our client registered previously.

>- In WriteRoot.connect method there is following code: "this.writeKeeper =
>null; this.watchManager.removeZooKeeper(this.writeKeeper);".
>Because writeKeeper is set to null before the removeZooKeeper call the old
>writeKeeper is never removed.

Correct. Should be similar to its counterpart in ReadRoot

>- In WriteRoot.connect method a new ZooKeeper instance is created and assigned
>to writeKeeper member but it never gets added to the watchManager (like done in
>initWriteKeeper). In fact it's best to remove it from the initWriteKeeper and
>add it in the "SyncConnected" event processing (similar to ReadRoot).

Agree.

Thanks
Comment 6 Scott Lewis CLA 2011-11-08 12:11:02 EST
Hi Ahmed,

ECF 3.5.3 is going to be released soon (as soon as our builder security issues get worked out)...are there any pending changes associated with this bug that should delay the release (as it would be nice to get them in immediately)...or should the build/release proceed as things are?

Thanks,

Scott
Comment 7 Wim Jongman CLA 2011-11-08 12:22:07 EST
fix released to master:

http://git.eclipse.org/c/ecf/org.eclipse.ecf.git/commit/?id=24c42c92c9bc367225c1d1db4f85b553fca82b89

(In reply to comment #6)
> Hi Ahmed,
> 
> ECF 3.5.3 is going to be released soon (as soon as our builder security issues
> get worked out)...are there any pending changes associated with this bug that
> should delay the release (as it would be nice to get them in immediately)...or
> should the build/release proceed as things are?
> 
> Thanks,
> 
> Scott
Comment 8 Franky Bridelance CLA 2011-11-09 08:43:02 EST
Hi Ahmed,

Thanks for the very precise response. See inline.

Best regards,

Franky

(In reply to comment #5)
> Hi Franky,
> 
> >What I see is that sometimes during the
> >republish of the first remote service (in NodeWrite.publish) a KeeperException
> >is thrown with code SESSIONEXPIRED (by ZooKeeper.exists method). This results
> >that the first remote service is never republished (hence the critical
> >severity). Btw this was hard to find because the "catch (KeeperException e)" in
> >NodeWrite.publish method only logs the exception in case the KeeperException
> >code is CONNECTIONLOSS...
> 
> That method's semantics was meant so, and as you can see nothing (except from
> logging) is done in catch clauses, as it's not 
> the task of a NodeWriter instance to recover from connectionless and session
> timeouts cases by itself. Actually, it just cannot.
> That task is dealt with centrally in WriteRoot where, anyway, connection-loss
> and timeouts are detected and dealt with by attempting to republish all (at
> that time known) services, thus inclusive the ones that failed middle-way.
> 
> If you're getting timeouts, then it's more likely related to the way we create
> zookeeper clients. Currently we build a writing zooKeeper client with a
> sessionTimeout of 3000ms as in code:
> 
> this.writeKeeper = new ZooKeeper(this.ip, 3000 <--sessionTimeout , this);
> in WriteRoot class.
> 
> As a first attempt helping this bug away (which is a very valuable bug case due
> its reproduction rarity), you'd rather, please, try widening that session
> time-out in those places. If that solves the timing out problem, then we can
> externalize that sessionTimeout setting somehow to give users more control.
> 

The timing out problem in my case happened when system time of the computer had changed (by synch with NTP server). Having a bigger session timeout at zookeeper client creation would have helped but system time changes of more than 3 seconds  won't happen very often I guess and as long as zoodiscovery recovers well (i.e. all remote service republished), a rare session timeout is not a big deal.

> 
> >Event processing in WriteRoot is done synchronously which means that the
> >reconnect is done in the "dying" EventThread of zookeeper. I see that in
> >ReadRoot event processing is done asynchronously (+ synchronization block), so
> >I guess that in WriteRoot it should also be done asynchronously (+
> >synchronization block).
> 
> The whole WriteRoot object runs on separate thread. This class deals primarily
> with ZooKeeper lifecycle call backs in contrast to node monitoring done by
> readers which is asyn on call basis. Yet, if wrapping that method's body
> in a separte thread similar to its counterpart in ReadRoot helps, I'm OK with
> it.

I can't really explain why event processing in a separate thread helps, I don't have enough insight in how zookeeper is working. It's just that when event processing is done in zookeeper's "dying" event thread, the first remote service republish can get a  session expired exception just as if zookeeper didn't have the time to fully recover from the previous session timeout...

> 
> >- In "Disconnected" and "Expired" event processing, isConnected is first set to
> >false and then unpublishAll is called on watchManager which will call remove
> >method on every NodeWriter. In this NodeWriter remove method first thing
> >checked is if isConnected of WriteRoot is true to unpublish the remote service
> >but it's always false... So no unpublish happens
> 
> The unpublishing action has 2 phases: First, removing the service instances
> from the collection of services known to the local ZooDiscovery, then
> deleting each service's node representation from the zooKeeper server. Hence,
> the second is only relevant for explicit service unregistration calls (e.g.
> ZooDiscoveryContainer.unregisterService(),unregisterService(IServiceInfo))
> PROVIDED we're still connected, of course.
> In the other 2 cases, "Disconnected" and "Expired", the second phase is neither
> needed nor possible and the being-connected-to server will automatically 
> detect our fall away leading it to delete the services/nodes our client
> registered previously.

Ok, understood.

> 
> >- In WriteRoot.connect method there is following code: "this.writeKeeper =
> >null; this.watchManager.removeZooKeeper(this.writeKeeper);".
> >Because writeKeeper is set to null before the removeZooKeeper call the old
> >writeKeeper is never removed.
> 
> Correct. Should be similar to its counterpart in ReadRoot
> 
> >- In WriteRoot.connect method a new ZooKeeper instance is created and assigned
> >to writeKeeper member but it never gets added to the watchManager (like done in
> >initWriteKeeper). In fact it's best to remove it from the initWriteKeeper and
> >add it in the "SyncConnected" event processing (similar to ReadRoot).
> 
> Agree.
> 
> Thanks
Comment 9 Wim Jongman CLA 2011-11-10 02:18:18 EST
Franky can you please install the new ECF release and confirm that things are fixed.
Comment 10 Franky Bridelance CLA 2011-11-17 08:39:43 EST
(In reply to comment #9)
> Franky can you please install the new ECF release and confirm that things are
> fixed.

Hi Wim,

Works for me.

Br,

Franky
Comment 11 Wim Jongman CLA 2011-11-17 11:07:50 EST
Closing. Thanks Franky and Ahmed.