| Summary: | Improve ZooKeeper connection robustness | ||
|---|---|---|---|
| Product: | z_Archived | Reporter: | Gunnar Wagenknecht <gunnar> |
| Component: | gyrex | Assignee: | Project Inbox <gyrex-platform-inbox> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | enhancement | ||
| Priority: | P3 | CC: | andreas.mihm |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Bug Depends on: | 360505, 360811 | ||
| Bug Blocks: | |||
|
Description
Gunnar Wagenknecht
Here is a sample log trace from a really severe connection loss of one of our production servers.
08:07:49,606 - ZooKeeper Gate is now DOWN. Connection to cloud lost.
08:07:49,632 - Jetty brings down all web apps
08:07:53,207 - Unable to set node to offline
08:08:05,595 - ZooKeeper Gate is now UP. Connection to cloud established.
08:08:05,605 - Node registration in cloud failed. Will retry.
(hint: ephemeral node still exists)
08:08:05,672 - context hierarchy flush begins (for all active contexts)
(happens on [ZooKeeper Gate Connect Thread-EventThread])
08:08:07,155 - many, many errors logged by ZooKeeperGate because of bogous
connection listener ZooKeeperBasedService due to exception
(IllegalStateException: Node '...' has been removed.)
(note: messages logged a second time by FrameworkLog)
08:08:12,645 - Node registration in cloud still failed. Will be retried.
08:08:18,796 - ZooKeeper Gate is now DOWN. Connection to cloud lost.
(connection lost a second time)
08:08:26,917 - ZooKeeper Gate is now UP. Connection to cloud established.
08:08:26,928 - Node registration in cloud failed. Will retry.
----
A few comments:
08:08:07,155 ... this may all be obsolete listeners/watches for preference nodes which have been removed; we likely don't properly remove them in ZK (well ZK doesn't have a remove watcher method); a node has been removed in ZK, event is triggered to preference node, preference node sets an exists hook, node is disconnected, node reconnects, connect event triggers reload, node removed exception triggers ... likely the exception is wrong ... Q: should we silently re-establish the #exists watch or discard it (the watched is invalid and we could rely on the children change event)
(In reply to comment #1) > 08:08:07,155 ... this may all be obsolete listeners/watches for preference > nodes which have been removed; we likely don't properly remove them in ZK (well > ZK doesn't have a remove watcher method); a node has been removed in ZK, event > is triggered to preference node, preference node sets an exists hook, node is > disconnected, node reconnects, connect event triggers reload, node removed > exception triggers ... likely the exception is wrong ... Q: should we silently > re-establish the #exists watch or discard it (the watched is invalid and we > could rely on the children change event) This was a quick win. I released a fix in order to properly close a service when a preference node is removed. This will remove the connection listener and the next time a ZK re-connect happens the listeners won't be called anymore. Fixed in 1.1 |