Fix AP upgrade version issue #27277

banks · 2024-05-30T12:49:05Z

This is a subtle issue introduced by #26449 while fixing a different issue. It only impacts Vault Enterprise but is generally simpler to reason about.

That PR was a correct fix, however the effectiveSDKVersion was only ever being set on active nodes not on perf standbys in Enterprise.

The big change here is:

Get rid of the custom Set method and instead always set the effectiveSDKVersion from Core during SetupCluster which is necessary on all types of servers
As a belt-and-braces measure, default RaftBackend.effectiveSDKVersion back to the binary version so even if we fail to plumb this from core in some edge case in the future it's still the right value in almost all cases (other than testing environments).

I've manually tested this locally. To reproduce the issue it fixes you need to:

Attempt an AP upgrade in Vault Enterprise from any version before 1.15.8 to any version after 1.15.8
Ensure you don't manually set autopilot_upgrade_version in config on the new servers (doing so bypasses the bug).

In this case you'll see the new servers join but remain "target version" because their Echo's are sending an empty version string. You also see some logs about empty upgrade_versions which are legit in this case but are also confusing because you can see them even after the fix due to timing (if AP runs before an Echo has come in but after node joined raft on the leader you'll see that log even on a non-broken cluster and then you don't see a corresponding "OK" log after wards).

TODO

also fix the logging confusion where we complain about no upgrade version due to timing and then don't log that it's OK again
review this scenario with others and test more cases to be sure this is correct.
See if there is a meaningful way we could test this in CI without a flaky complicated end-to-end test just for plumbing...

github-actions · 2024-05-30T13:02:28Z

CI Results:
All Go tests succeeded! ✅

banks · 2024-05-30T14:13:13Z

physical/raft/raft.go

@@ -582,6 +583,7 @@ func NewRaftBackend(conf map[string]string, logger log.Logger) (physical.Backend
 		failGetInTxn:                  new(uint32),
 		raftLogVerifierEnabled:        backendConfig.RaftLogVerifierEnabled,
 		raftLogVerificationInterval:   backendConfig.RaftLogVerificationInterval,
+		effectiveSDKVersion:           version.GetVersion().Version,


Setting it by default to the binary version is a defensive move so we have something that's 99% of the time going to be correct anyway even if the plumbing of the effective version is still incorrect in some edge case (or becomes incorrect later).

banks · 2024-05-30T14:14:07Z

physical/raft/raft.go

+	// EffectiveSDKVersion is typically the version string baked into the binary.
+	// We pass it in though because it can be overridden in tests or via ENV in
+	// core.
+	EffectiveSDKVersion string


I preferred making this an option when we setup the cluster since that is an existing mechanism for runtime config that isn't known at NewRaftBackend time rather than a customer Set* method that needs correct plumbing.

banks · 2024-05-30T14:21:22Z

vault/raft.go

+			TLSKeyring:          raftTLS,
+			ClusterListener:     c.getClusterListener(),
+			StartAsLeader:       creating,
+			EffectiveSDKVersion: c.effectiveSDKVersion,


Note that c.effectiveSDKVersion is only ever set in the CreateCore method before the Core is returned so if it's going to be set it will have been by now and doesn't need any locking to read it (we read it without locks in other places too). That means this will always be the correct value by the time this call is made - configured by coreConfig or defaulted in CreateCore.

raskchanky

These changes look great Paul, and make sense to me.

banks · 2024-06-03T19:32:06Z

FWIW I manually tested the logging changes (to be pushed shortly) with a custom binary that introduced an artificial 30s delay to heartbeats starting so I could reliably reproduce. It turned out that even the old nodes without the 30s delay still showed the logs off in this run!

...

banks · 2024-06-03T20:25:36Z

I tried writing a test for this plumbing at least to catch any future regressions. It looked like this:

// TestRaft_Autopilot_BinaryVersionPlumbing is an apparently trivial test that
// ensures that the default plumbing in Vault core to configure the binary
// version of the raft library is working. Hopefully this will trivially pass
// from now on, however it would have caught a regression in the past!
func TestRaft_Autopilot_BinaryVersionPlumbing(t *testing.T) {
	t.Parallel()

	coreCfg, clusterOpts := raftClusterBuilder(t, &RaftClusterOpts{
		EnableAutopilot: true,
		// We need 2 nodes because the code path that regressed was different on a
		// standby vs active node so we'd not detect the problem if we only test on
		// an active node.
		NumCores: 2,
	})

	// Default options should not set EffectiveSDKVersion(Map) which would defeat
	// the point of this test by plumbing versions via config.
	require.Nil(t, clusterOpts.EffectiveSDKVersionMap)
	require.Empty(t, coreCfg.EffectiveSDKVersion)

	c := vault.NewTestCluster(t, coreCfg, &clusterOpts)

	vault.TestWaitActive(t, c.Cores[0].Core)

	client := c.Cores[0].Client

	// Have to wait for the follower to heartbeat at least once or we're just
	// effectively reading the leader's own version back. We know when this has
	// happened because it's AppliedIndex will be > 0. Just to be robust against
	// assuming core-0 is the leader, check that all servers have an AppliedIndex
	// > 0.
	corehelpers.RetryUntil(t, 3*time.Minute, func() error {
		state, err := client.Sys().RaftAutopilotState()
		if err != nil {
			return err
		}
		if state == nil {
			return fmt.Errorf("no state")
		}
		wantVersion := version.GetVersion().Version
		for _, s := range state.Servers {
			if s.LastIndex == 0 {
				return fmt.Errorf("server %q has not sent a heartbeat", s.ID)
			}
			if s.Version != wantVersion {
				return fmt.Errorf("want version %s, got %s", wantVersion, s.Version)
			}
		}
		return nil
	})
}

The problem is, that this test passes on main without this fix PR so it's not really very usefull. The reason for that is that the standby nodes are build on the same binary so the fact that they are sending empty Version strings is masked by the fact that the active node is "filling in" their strings with it's own, and since it's the same binary it will always match and be indistinguishable from the buggy case. 🤔

To get a test case that really reproduces the bug this PR fixes, we'd need the followers to be a different binary version. We could possibly do that with a docker-based test I guess but it would be a huge pain without shimming the versions through config... and once you do that you no longer test the actual default code path that caused the regression!

My feeling is that we have a case of diminishing returns jumping through hoops to get a docker integration test that actually checks out and builds a different version of the binary on some nodes in order to really exercise this. Our manual testing (which we will now codify in a script with a thoroughly considered matix of scenarios) will need to be enough.

Open to ideas if anyone else can come up with something that is worth adding to CI tests though!

This reverts commit e25fcd8.

github-actions · 2024-06-03T20:32:51Z

Build Results:
All builds succeeded! ✅

banks · 2024-06-03T20:56:34Z

OK I figured out a way to test this that reliably fails on main 😅 .

I feel OK about this since the backend already exposes UpdateVersion publicly so we can directly test the pluming on both active and standby without having to infer it from AP state.

This fails on main because core-1 the perf standby in the cluster will not have had an effectiveSDKVersion set as it's not the active node and so it will return the empty string.

raskchanky · 2024-06-03T22:44:39Z

vault/external_tests/raft/raft_autopilot_test.go

+	c := vault.NewTestCluster(t, coreCfg, &clusterOpts)
+	defer c.Cleanup()
+
+	// Wait for follower to be perf standby


Since this is CE Vault, the follower will never become a perf standby, will it?

No, in CE that helper just waits for active node. So yeah it's possible this test would not catch the regression in CE before this fix or would fail for the wrong reason, but either way I think it passes now and actively catches the issue in Enterprise so it seem worthwhile keeping it.

I'll update the comment 👍

raskchanky

I left one small comment about a slightly misleading comment, but otherwise looks great!

raskchanky · 2024-06-05T20:01:45Z

I think we want to backport this to 1.17 too right? Though perhaps not merge that backport until after 1.17.0 is released.

* Fix AP upgrade version issue * add heartbeat logging at trace level * add log to show when heartbeats resume * Test the plumbing * Revert "Test the plumbing" This reverts commit e25fcd8. * Add CHANGELOG * Add plumbing test * Update misleading comment --------- Co-authored-by: Josh Black <[email protected]>

Fix AP upgrade version issue

b6037d3

banks requested a review from raskchanky May 30, 2024 12:49

github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label May 30, 2024

banks commented May 30, 2024

View reviewed changes

raskchanky reviewed May 30, 2024

View reviewed changes

add heartbeat logging at trace level

f75d137

banks added 2 commits June 3, 2024 20:33

add log to show when heartbeats resume

a3f6166

Test the plumbing

e25fcd8

Revert "Test the plumbing"

9613a50

This reverts commit e25fcd8.

banks marked this pull request as ready for review June 3, 2024 20:27

banks added 2 commits June 3, 2024 21:27

Merge branch 'main' into fix/autopilot-upgrade-ii

a1dba62

Add CHANGELOG

3baffa8

banks added this to the 1.17.0 milestone Jun 3, 2024

banks added backport/1.15.x labels Jun 3, 2024

Add plumbing test

2bd366d

raskchanky reviewed Jun 3, 2024

View reviewed changes

raskchanky approved these changes Jun 3, 2024

View reviewed changes

banks added 2 commits June 5, 2024 11:48

Update misleading comment

7efc4d2

Merge branch 'main' into fix/autopilot-upgrade-ii

d25a59c

vercel bot deployed to Preview June 5, 2024 10:55 View deployment

banks merged commit a04c53e into main Jun 5, 2024
83 checks passed

banks deleted the fix/autopilot-upgrade-ii branch June 5, 2024 17:12

This was referenced Jun 5, 2024

Backport of Fix AP upgrade version issue into release/1.15.x #27363

Closed

Backport of Fix AP upgrade version issue into release/1.16.x #27364

Closed

raskchanky added the backport/1.17.x label Jun 5, 2024

hc-github-team-secure-vault-core mentioned this pull request Jun 5, 2024

Backport of Fix AP upgrade version issue into release/1.17.x #27377

Merged

3 tasks

banks added backport/ent/1.15.x+ent Changes are backported to 1.15.x+ent backport/ent/1.16.x+ent Changes are backported to 1.16.x+ent and removed backport/1.15.x labels Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AP upgrade version issue #27277

Fix AP upgrade version issue #27277

banks commented May 30, 2024 •

edited

Loading

github-actions bot commented May 30, 2024 •

edited

Loading

banks May 30, 2024

raskchanky May 30, 2024

banks May 30, 2024

banks May 30, 2024

raskchanky left a comment

banks commented Jun 3, 2024

banks commented Jun 3, 2024

github-actions bot commented Jun 3, 2024 •

edited

Loading

banks commented Jun 3, 2024

raskchanky Jun 3, 2024

banks Jun 5, 2024

banks Jun 5, 2024

raskchanky left a comment

raskchanky commented Jun 5, 2024

Fix AP upgrade version issue #27277

Fix AP upgrade version issue #27277

Conversation

banks commented May 30, 2024 • edited Loading

TODO

github-actions bot commented May 30, 2024 • edited Loading

banks May 30, 2024

Choose a reason for hiding this comment

raskchanky May 30, 2024

Choose a reason for hiding this comment

banks May 30, 2024

Choose a reason for hiding this comment

banks May 30, 2024

Choose a reason for hiding this comment

raskchanky left a comment

Choose a reason for hiding this comment

banks commented Jun 3, 2024

banks commented Jun 3, 2024

github-actions bot commented Jun 3, 2024 • edited Loading

banks commented Jun 3, 2024

raskchanky Jun 3, 2024

Choose a reason for hiding this comment

banks Jun 5, 2024

Choose a reason for hiding this comment

banks Jun 5, 2024

Choose a reason for hiding this comment

raskchanky left a comment

Choose a reason for hiding this comment

raskchanky commented Jun 5, 2024

banks commented May 30, 2024 •

edited

Loading

github-actions bot commented May 30, 2024 •

edited

Loading

github-actions bot commented Jun 3, 2024 •

edited

Loading