-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sys/storage/raft/bootstrap option to reset TLS keyring #18615
Add sys/storage/raft/bootstrap option to reset TLS keyring #18615
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I'm wondering if there is anything else that we can do to get a maintainer to review this.
@hsimon-hashicorp any chance you can help solicit a reviewer on this? I'm here again running a very similar disaster recovery simulation as in #14032 and noticing that this MR to fix/solve the issue has been sitting idle since it was submitted back in January. Any help would be appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there! I'm very sorry this took such a long time to get eyes on. If you're still interested, I'm still happy to work towards getting this merged, as this would be a nice improvement to Vault.
I've outlined a few suggestions I'd like to see before I approve this. Most importantly, I think we can avoid the new option added to the bootstrap
endpoint by checking to see if we're using raft storage instead. There's also some merge conflicts to resolve, but they seem simple enough.
Before getting this merged, we'd also need a changelog. This would be a file called 18615.txt
in the changelog
directory that should look something like this (I'll leave the wording to you, feel free to look at the other files in that directory for inspiration):
release-note:bug
core: fixed issue when attempting to bootstrap HA when using Raft as HA but not storage
I love the test addition, and the quality of the submitted code is very encouraging 👌. If you have any questions about my comments, feel free to ask!
if @Shaeli isn't available/interested in updating this PR I can probably find some time to adopt it and update as needed as long as we've got organizational engagement from hashicorp that makes it likely to be re-reviewed with a high likelihood of merging. It is a bad game of tag if we keep the current pace around this issue. |
👋 Thanks @VioletHynes for the review, I will update soon |
@VioletHynes I updated based on your recommendations - I removed the parameter, updated everywhere accordingly and fixed the tests (back then, this was probably one of my first golang code, that's why I didn't know We might need to add another test to ensure it doesn't change current functionality with RAFT backend. Thanks for the detailed review, let me know what you think of the updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! As you mentioned, I think it'd be great if we had raft as a test here too. Could you have a quick look to see if that would be easy enough to add?
I have a small comment on the wording of the comment that I'd like to see changed, but that's a minor thing.
Without the raft test, I'd still like to merge this, as I think the chance of issue is very low, but I would appreciate if you checked if one was easy to add. Let me know!
After you've looked into that, I'd be happy to merge this. Thanks for bearing with us for so long, and I appreciate the PR updates despite the wait.
Co-authored-by: Violet Hynes <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for the diligence here, as well as the patience in getting this reviewed and resolved.
I'm going to wait for all of the tests to run, but assuming no nasty surprises, I'll be merging this once we get back all green.
Seems like updating the branch with almost 2y changes broke the test, I'm checking ! |
Right, yeah, looks like a few of the old references need to be updated, and you may need to find an alternative to |
@VioletHynes This should be fixed 🤞 |
Thanks for the quick update! Running the tests again 🤞 |
4ece7b9
to
59e7c10
Compare
Do you know by chance if this test failure is something known / expected? 😬
|
That seems to me like a known flaky and your code shouldn't touch this at all. I'm giving it a re-run. I'm pretty sure we're all good on your end, but I'll keep re-running until we're green just in case. |
Thanks for your contribution, and again, apologies for how long it took to get eyes on this. This was a great PR and I appreciate your quick responses and high quality, tested code. Merged :) |
This PR resolves #14032
It adds a new
reset_tls_keyring
optional parameter for thesys/storage/raft/bootstrap
endpoint.This will help recovering a vault cluster using raft for HA management only and not storage.
Currently, if a cluster is using raft for its HA management but another storage system (mysql, S3 etc), it's not possible to easily recover vault data when all servers of the cluster are not functioning anymore.
When re-creating vault nodes and connecting them to the storage, no node can be elected active because there is no raft cluster assembled to conduct the vote. Using the bootstrap endpoint (
sys/storage/raft/bootstrap
) is failing with :There is a workaround to recreate a vault cluster, which is:
VAULT_TOKEN=<recovery-token> vault delete sys/raw/core/raft/tls
vault write -f sys/storage/raft/bootstrap
: the current node is now leader and no data was lost.But this is not ideal as it's dangerously manipulating the storage. As the storage is persistent and not lost with the dead servers (storage is not RAFT), we should be able to re-bootstrap the cluster and recover from new servers.
The issue is also described here: #14032 and mentioned here: https://discuss.hashicorp.com/t/how-to-migrate-from-non-ha-backend-to-non-ha-backend-raft-ha-backend/13702/10
I'm not familiar with golang, so please let me know if anything can be improved :)