Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 358210 - Improve ZooKeeper connection robustness
Summary: Improve ZooKeeper connection robustness
Status: RESOLVED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: gyrex (show other bugs)
Version: unspecified   Edit
Hardware: All All
: P3 enhancement (vote)
Target Milestone: ---   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on: 360505 360811
Blocks:
  Show dependency tree
 
Reported: 2011-09-20 05:52 EDT by Gunnar Wagenknecht CLA
Modified: 2018-03-19 11:59 EDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gunnar Wagenknecht CLA 2011-09-20 05:52:21 EDT
Gyrex should provider better support for maintenance work on a ZooKeeper ensemble.

Things to consider:

Fail-over ZooKeeper session on a single (or group of) Gyrex node(s) to a specific ZooKeeper connection (eg. reduced set of ZooKeeper clients).

Console command to "flush" connection.

Ensure graceful handling of connection loss and recover.
Comment 1 Gunnar Wagenknecht CLA 2011-10-13 09:47:05 EDT
Here is a sample log trace from a really severe connection loss of one of our production servers.


08:07:49,606 - ZooKeeper Gate is now DOWN. Connection to cloud lost.
08:07:49,632 - Jetty brings down all web apps
08:07:53,207 - Unable to set node to offline
08:08:05,595 - ZooKeeper Gate is now UP. Connection to cloud established.
08:08:05,605 - Node registration in cloud failed. Will retry. 
               (hint: ephemeral node still exists)
08:08:05,672 - context hierarchy flush begins (for all active contexts)
               (happens on [ZooKeeper Gate Connect Thread-EventThread])
08:08:07,155 - many, many errors logged by ZooKeeperGate because of bogous 
               connection listener  ZooKeeperBasedService due to exception
               (IllegalStateException: Node '...' has been removed.)
               (note: messages logged a second time by FrameworkLog)
08:08:12,645 - Node registration in cloud still failed. Will be retried.
08:08:18,796 - ZooKeeper Gate is now DOWN. Connection to cloud lost.
               (connection lost a second time)
08:08:26,917 - ZooKeeper Gate is now UP. Connection to cloud established.
08:08:26,928 - Node registration in cloud failed. Will retry.

----

A few comments:

08:08:07,155 ... this may all be obsolete listeners/watches for preference nodes which have been removed; we likely don't properly remove them in ZK (well ZK doesn't have a remove watcher method); a node has been removed in ZK, event is triggered to preference node, preference node sets an exists hook, node is disconnected, node reconnects, connect event triggers reload, node removed exception triggers ... likely the exception is wrong ... Q: should we silently re-establish the #exists watch or discard it (the watched is invalid and we could rely on the children change event)
Comment 2 Gunnar Wagenknecht CLA 2011-10-13 11:58:11 EDT
(In reply to comment #1)
> 08:08:07,155 ... this may all be obsolete listeners/watches for preference
> nodes which have been removed; we likely don't properly remove them in ZK (well
> ZK doesn't have a remove watcher method); a node has been removed in ZK, event
> is triggered to preference node, preference node sets an exists hook, node is
> disconnected, node reconnects, connect event triggers reload, node removed
> exception triggers ... likely the exception is wrong ... Q: should we silently
> re-establish the #exists watch or discard it (the watched is invalid and we
> could rely on the children change event)

This was a quick win. I released a fix in order to properly close a service when a preference node is removed. This will remove the connection listener and the next time a ZK re-connect happens the listeners won't be called anymore.
Comment 3 Gunnar Wagenknecht CLA 2012-07-11 05:00:06 EDT
Fixed in 1.1