Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watching the server lock differently #5132

Closed
dlmarion opened this issue Dec 3, 2024 · 4 comments · Fixed by #5145
Closed

Watching the server lock differently #5132

dlmarion opened this issue Dec 3, 2024 · 4 comments · Fixed by #5145
Assignees
Milestone

Comments

@dlmarion
Copy link
Contributor

dlmarion commented Dec 3, 2024

The half-dead TabletServer problem occurs when the TabletServer's ZooKeeper connection has timed out and the ephemeral lock in the ZooKeeper server has been removed, but the TabletServer's ZooKeeper Watcher has not been notified. When the lock watcher is notified, the TabletServer halts itself. When the lock watcher is not notified, the TabletServer may continue to try to write mutations for tablets that may be hosted somewhere else because the Manager has noticed that the ZooKeeper lock is missing and re-assigned the tablets.

Looking in the ZooKeeper client code, Watcher notifiications occur in ClientCnxn.EventThread.run, specifically in the call to processEvent. There is one EventThread created in the ClientCnxn object, so if the Watcher callback method takes a long time or is stuck, then the lock loss notification may not occur in a timely fashion.

One solution might be to create a new Thread in the server processes that uses a different ZooKeeper client object specifically to watch the server lock. This Thread could call ZooKeeper.exists(path, false) periodically to check that that lock node exists without using a Watcher. This thread could run at a higher priority so that it has a higher probability of getting cpu time when under load, but use Thread.sleep between iterations so that it's yielding for some amount of time for other threads.

@keith-turner
Copy link
Contributor

Would be good if the tsever did a check of its lock after experiencing an error writing to its walog. Currently the following can happen.

  1. Manager revokes lease on a tservers walog when it determines a tserver is dead.
  2. If the tserver is still alive this will cause it to have a write error when writing to its walog.
  3. When the tserver has a write error it jsut creates a new walog.

Would be nice to have the tserver check its lock between steps 2 and 3 above. That could be done by calling the same code that is used for periodic check.

@keith-turner
Copy link
Contributor

keith-turner commented Dec 4, 2024

An alternative to a dedicated thread that periodically checks is to have a dedicated zookeeper object for lock acquisition. Not sure if this is better or worse overall. It would create an extra zookeeper connection per server, but would avoid the constant pings to zookeeper allowing Accumulo to rely on the zookeeper notification mechanisms.

This periodic check may not be needed on all server type. Its the most important on tablet servers and the manager. Compactors and scan servers could forgo this check in order to avoid introducing extra load on zookeeper.

@keith-turner
Copy link
Contributor

This is an example of code that does a lot of stuff when processing a zookeeper event. This code reads from zookeeper when processing a zookeeper event. This type of activity could tie up the single thread and prevent processing other events like the one for lost lock. Posting this as an example of why this change could be useful.

@ctubbsii
Copy link
Member

ctubbsii commented Dec 4, 2024

It would create an extra zookeeper connection per server, but would avoid the constant pings to zookeeper allowing Accumulo to rely on the zookeeper notification mechanisms.

I think that we are already creating more ZK client objects than we need to anyway, which is something I'm addressing as part of #5124, so it may not be so bad to have the separate one just for this purpose.

@dlmarion dlmarion self-assigned this Dec 5, 2024
dlmarion added a commit to dlmarion/accumulo that referenced this issue Dec 5, 2024
Moved getLock() from TabletHostingServer to AbstractServer. Added method
to ServiceLock to verfify the lock is being held in ZooKeeper versus
relying on the Watcher. Added method to start verification thread in
AbstractServer, which is called from the Manager and TabletServer.

Closes apache#5132
dlmarion added a commit that referenced this issue Dec 7, 2024
…ead (#5145)

Moved getLock() from TabletHostingServer to AbstractServer. Added method
to ServiceLock to verfify the lock is being held in ZooKeeper versus
relying on the Watcher. Added method to start verification thread in
AbstractServer, which is called from the Manager and TabletServer.

Closes #5132


Co-authored-by: Keith Turner <[email protected]>
@dlmarion dlmarion added this to the 2.1.4 milestone Dec 7, 2024
@dlmarion dlmarion closed this as completed Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants