-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watching the server lock differently #5132
Comments
Would be good if the tsever did a check of its lock after experiencing an error writing to its walog. Currently the following can happen.
Would be nice to have the tserver check its lock between steps 2 and 3 above. That could be done by calling the same code that is used for periodic check. |
An alternative to a dedicated thread that periodically checks is to have a dedicated zookeeper object for lock acquisition. Not sure if this is better or worse overall. It would create an extra zookeeper connection per server, but would avoid the constant pings to zookeeper allowing Accumulo to rely on the zookeeper notification mechanisms. This periodic check may not be needed on all server type. Its the most important on tablet servers and the manager. Compactors and scan servers could forgo this check in order to avoid introducing extra load on zookeeper. |
This is an example of code that does a lot of stuff when processing a zookeeper event. This code reads from zookeeper when processing a zookeeper event. This type of activity could tie up the single thread and prevent processing other events like the one for lost lock. Posting this as an example of why this change could be useful. accumulo/server/base/src/main/java/org/apache/accumulo/server/manager/LiveTServerSet.java Line 307 in c6ec969
|
I think that we are already creating more ZK client objects than we need to anyway, which is something I'm addressing as part of #5124, so it may not be so bad to have the separate one just for this purpose. |
Moved getLock() from TabletHostingServer to AbstractServer. Added method to ServiceLock to verfify the lock is being held in ZooKeeper versus relying on the Watcher. Added method to start verification thread in AbstractServer, which is called from the Manager and TabletServer. Closes apache#5132
…ead (#5145) Moved getLock() from TabletHostingServer to AbstractServer. Added method to ServiceLock to verfify the lock is being held in ZooKeeper versus relying on the Watcher. Added method to start verification thread in AbstractServer, which is called from the Manager and TabletServer. Closes #5132 Co-authored-by: Keith Turner <[email protected]>
The half-dead TabletServer problem occurs when the TabletServer's ZooKeeper connection has timed out and the ephemeral lock in the ZooKeeper server has been removed, but the TabletServer's ZooKeeper Watcher has not been notified. When the lock watcher is notified, the TabletServer halts itself. When the lock watcher is not notified, the TabletServer may continue to try to write mutations for tablets that may be hosted somewhere else because the Manager has noticed that the ZooKeeper lock is missing and re-assigned the tablets.
Looking in the ZooKeeper client code, Watcher notifiications occur in ClientCnxn.EventThread.run, specifically in the call to
processEvent
. There is one EventThread created in the ClientCnxn object, so if the Watcher callback method takes a long time or is stuck, then the lock loss notification may not occur in a timely fashion.One solution might be to create a new Thread in the server processes that uses a different ZooKeeper client object specifically to watch the server lock. This Thread could call
ZooKeeper.exists(path, false)
periodically to check that that lock node exists without using a Watcher. This thread could run at a higher priority so that it has a higher probability of getting cpu time when under load, but useThread.sleep
between iterations so that it's yielding for some amount of time for other threads.The text was updated successfully, but these errors were encountered: