Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sys/storage/raft/bootstrap option to reset TLS keyring #18615

Merged
merged 8 commits into from
Oct 1, 2024

Conversation

Shaeli
Copy link
Contributor

@Shaeli Shaeli commented Jan 6, 2023

This PR resolves #14032

It adds a new reset_tls_keyring optional parameter for the sys/storage/raft/bootstrap endpoint.

This will help recovering a vault cluster using raft for HA management only and not storage.

Currently, if a cluster is using raft for its HA management but another storage system (mysql, S3 etc), it's not possible to easily recover vault data when all servers of the cluster are not functioning anymore.

When re-creating vault nodes and connecting them to the storage, no node can be elected active because there is no raft cluster assembled to conduct the vote. Using the bootstrap endpoint (sys/storage/raft/bootstrap) is failing with :

$ vault write -f sys/storage/raft/bootstrap
Error writing data to sys/storage/raft/bootstrap: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/sys/storage/raft/bootstrap
Code: 500. Errors:

* could not generate TLS keyring during bootstrap: TLS keyring already present

There is a workaround to recreate a vault cluster, which is:

  • Start vault in recovery mode
  • Generate a recovery token
  • Run VAULT_TOKEN=<recovery-token> vault delete sys/raw/core/raft/tls
  • Restart vault in non-recovery mode
  • Run vault write -f sys/storage/raft/bootstrap : the current node is now leader and no data was lost.

But this is not ideal as it's dangerously manipulating the storage. As the storage is persistent and not lost with the dead servers (storage is not RAFT), we should be able to re-bootstrap the cluster and recover from new servers.

The issue is also described here: #14032 and mentioned here: https://discuss.hashicorp.com/t/how-to-migrate-from-non-ha-backend-to-non-ha-backend-raft-ha-backend/13702/10

I'm not familiar with golang, so please let me know if anything can be improved :)

@hashicorp-cla
Copy link

hashicorp-cla commented Jan 6, 2023

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@drawks drawks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I'm wondering if there is anything else that we can do to get a maintainer to review this.

@drawks
Copy link
Contributor

drawks commented Sep 19, 2023

@hsimon-hashicorp any chance you can help solicit a reviewer on this? I'm here again running a very similar disaster recovery simulation as in #14032 and noticing that this MR to fix/solve the issue has been sitting idle since it was submitted back in January. Any help would be appreciated.

Copy link
Contributor

@VioletHynes VioletHynes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there! I'm very sorry this took such a long time to get eyes on. If you're still interested, I'm still happy to work towards getting this merged, as this would be a nice improvement to Vault.

I've outlined a few suggestions I'd like to see before I approve this. Most importantly, I think we can avoid the new option added to the bootstrap endpoint by checking to see if we're using raft storage instead. There's also some merge conflicts to resolve, but they seem simple enough.

Before getting this merged, we'd also need a changelog. This would be a file called 18615.txt in the changelog directory that should look something like this (I'll leave the wording to you, feel free to look at the other files in that directory for inspiration):
release-note:bug
core: fixed issue when attempting to bootstrap HA when using Raft as HA but not storage

I love the test addition, and the quality of the submitted code is very encouraging 👌. If you have any questions about my comments, feel free to ask!

http/sys_raft.go Outdated Show resolved Hide resolved
vault/raft.go Outdated Show resolved Hide resolved
vault/external_tests/raftha/raft_ha_test.go Outdated Show resolved Hide resolved
vault/external_tests/raftha/raft_ha_test.go Outdated Show resolved Hide resolved
vault/external_tests/raftha/raft_ha_test.go Outdated Show resolved Hide resolved
vault/external_tests/raftha/raft_ha_test.go Show resolved Hide resolved
vault/external_tests/raftha/raft_ha_test.go Outdated Show resolved Hide resolved
vault/external_tests/raftha/raft_ha_test.go Outdated Show resolved Hide resolved
vault/external_tests/raftha/raft_ha_test.go Outdated Show resolved Hide resolved
vault/external_tests/raftha/raft_ha_test.go Outdated Show resolved Hide resolved
@drawks
Copy link
Contributor

drawks commented Sep 25, 2024

if @Shaeli isn't available/interested in updating this PR I can probably find some time to adopt it and update as needed as long as we've got organizational engagement from hashicorp that makes it likely to be re-reviewed with a high likelihood of merging. It is a bad game of tag if we keep the current pace around this issue.

@Shaeli
Copy link
Contributor Author

Shaeli commented Sep 26, 2024

👋
@drawks I can update the PR with the suggestions, probably tomorrow or over the week end and 🤞 we can get this reviewed again and merged!

Thanks @VioletHynes for the review, I will update soon

@Shaeli
Copy link
Contributor Author

Shaeli commented Oct 1, 2024

@VioletHynes I updated based on your recommendations - I removed the parameter, updated everywhere accordingly and fixed the tests (back then, this was probably one of my first golang code, that's why I didn't know require / assert functions haha)

We might need to add another test to ensure it doesn't change current functionality with RAFT backend.

Thanks for the detailed review, let me know what you think of the updates

Copy link
Contributor

@VioletHynes VioletHynes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! As you mentioned, I think it'd be great if we had raft as a test here too. Could you have a quick look to see if that would be easy enough to add?

I have a small comment on the wording of the comment that I'd like to see changed, but that's a minor thing.

Without the raft test, I'd still like to merge this, as I think the chance of issue is very low, but I would appreciate if you checked if one was easy to add. Let me know!

After you've looked into that, I'd be happy to merge this. Thanks for bearing with us for so long, and I appreciate the PR updates despite the wait.

vault/raft.go Outdated Show resolved Hide resolved
vault/external_tests/raftha/raft_ha_test.go Show resolved Hide resolved
Copy link
Contributor

@VioletHynes VioletHynes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for the diligence here, as well as the patience in getting this reviewed and resolved.

I'm going to wait for all of the tests to run, but assuming no nasty surprises, I'll be merging this once we get back all green.

@Shaeli
Copy link
Contributor Author

Shaeli commented Oct 1, 2024

Looks great! Thanks for the diligence here, as well as the patience in getting this reviewed and resolved.

I'm going to wait for all of the tests to run, but assuming no nasty surprises, I'll be merging this once we get back all green.

Seems like updating the branch with almost 2y changes broke the test, I'm checking !

@VioletHynes
Copy link
Contributor

Right, yeah, looks like a few of the old references need to be updated, and you may need to find an alternative to TestRaftServerAddressProvider. I'm not sure why it was removed, but I don't think there'd be much harm in keeping whatever the body was as a local function for this test. There are some examples around the code of how to create these elsewhere, like NewHardcodedServerAddressProvider.

@Shaeli
Copy link
Contributor Author

Shaeli commented Oct 1, 2024

@VioletHynes This should be fixed 🤞

@VioletHynes
Copy link
Contributor

Thanks for the quick update! Running the tests again 🤞

@Shaeli Shaeli force-pushed the allow-reboostrap-raft-storage branch from 4ece7b9 to 59e7c10 Compare October 1, 2024 15:03
@Shaeli
Copy link
Contributor Author

Shaeli commented Oct 1, 2024

Do you know by chance if this test failure is something known / expected? 😬
I don't think its related to my change, but it could be a unexpected side effect

=== FAIL: vault Test_handleActivityWriteData/write_entities (0.01s)
    logical_system_activity_write_testonly_test.go:577: 
        	Error Trace:	/home/runner/work/vault/vault/vault/logical_system_activity_write_testonly_test.go:577
        	Error:      	"[]" should have 1 item(s), but has 0
        	Test:       	Test_handleActivityWriteData/write_entities

@VioletHynes
Copy link
Contributor

That seems to me like a known flaky and your code shouldn't touch this at all. I'm giving it a re-run. I'm pretty sure we're all good on your end, but I'll keep re-running until we're green just in case.

@VioletHynes VioletHynes merged commit b195342 into hashicorp:main Oct 1, 2024
66 of 67 checks passed
@VioletHynes
Copy link
Contributor

Thanks for your contribution, and again, apologies for how long it took to get eyes on this. This was a great PR and I appreciate your quick responses and high quality, tested code.

Merged :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sys/storage/raft/bootstrap should have an option to force boostrap
4 participants