Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tablet servers respond more aggressively to failed minor compaction. #5137

Open
keith-turner opened this issue Dec 4, 2024 · 2 comments · May be fixed by #5169
Open

Make tablet servers respond more aggressively to failed minor compaction. #5137

keith-turner opened this issue Dec 4, 2024 · 2 comments · May be fixed by #5169
Assignees
Labels
enhancement This issue describes a new feature, improvement, or optimization.
Milestone

Comments

@keith-turner
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Sometimes when a minor compaction fails with an exception it will retry. In cases where it does not retry because its not known if its safe to retry then it may be best to take some action to prevent future interaction with the tablet. Currently the only action taken is an exception is logged and the tablet is left in a half functional state where reads/writes could still attempt to run against it. No minor compaction will ever run again on the tablet.

Describe the solution you'd like

There are a few possible actions that could be taken

  • Halt the tablet server
  • Make all operations against the tablet fail, like all read and write RPCs would fail after the minor compaction failed.
  • Try to increase the ability to retry, however will not be able to always retry. If a NPE, ArrayOutofBoundsException, etc happen during a minor compaction then the state of the tablet in the tablet server is unknown at that point and retrying may not lead to correct result.
@keith-turner keith-turner added the enhancement This issue describes a new feature, improvement, or optimization. label Dec 4, 2024
@dlmarion
Copy link
Contributor

dlmarion commented Dec 6, 2024

If #5145 is merged, then this could call ServiceLock.verifyLockAtSource() and halt the VM if it returns false. If it returns true, maybe retry.

@keith-turner
Copy link
Contributor Author

If #5145 is merged, then this could call ServiceLock.verifyLockAtSource() and halt the VM if it returns false. If it returns true, maybe retry.

That sounds good. Also opened #5146 about checking after walog failures.

@dlmarion dlmarion self-assigned this Dec 12, 2024
dlmarion added a commit to dlmarion/accumulo that referenced this issue Dec 12, 2024
@dlmarion dlmarion added this to the 2.1.4 milestone Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This issue describes a new feature, improvement, or optimization.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants