storage controller: add node deletion API #8226

jcsp · 2024-07-01T20:41:05Z

Problem

In anticipation of later adding a really nice drain+delete API, I initially only added an intentionally basic /drop API that is just about usable for deleting nodes in a pinch, but requires some ugly storage controller restarts to persuade it to restart secondaries.

Summary of changes

I started making a few tiny fixes, and ended up writing the delete API...

Quality of life nit: ordering of node + tenant listings in storcon_cli
Papercut: Fix the attach_hook using the wrong operation type for reporting slow locks
Make Service::spawn tolerate generation_pageserver columns that point to nonexistent node IDs. I started out thinking of this as a general resilience thing, but when implementing the delete API I realized it was actually a legitimate end state after the delete API is called (as that API doesn't wait for all reconciles to succeed).
Add a DELETE API for nodes, which does not gracefully drain, but does reschedule everything. This becomes safe to use when the system is in any state, but will incur availability gaps for any tenants that weren't already live-migrated away. If tenants have already been drained, this becomes a totally clean + safe way to decom a node.
Add a test and a storcon_cli wrapper for it

FIXME: the node deletion function suffers the same awkwardness as other functions that iterate through shards and call schedule(): it doesn't have a proper ScheduleContext for them all. That doesn't break anything, it just means that some shards may later get migrated again in the background.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-07-01T21:21:48Z

3012 tests run: 2897 passed, 0 failed, 115 skipped (full report)

Flaky tests (1)

Postgres 16

test_subscriber_restart: release

Code coverage* (full report)

functions: 32.7% (6916 of 21165 functions)
lines: 50.0% (54209 of 108401 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
ee09248 at 2024-07-02T09:29:10.926Z :recycle:}

VladLazar

Have you considered implementing this as a background operation where the caller has to poll for the absence of the node? It would look a lot like the drain code and I think it would be easier on the operator (i.e us 😄 ).

VladLazar · 2024-07-03T16:37:40Z

storage_controller/src/service.rs

+                        )
+                    }
+
+                    self.maybe_reconcile_shard(shard, nodes);


What's the rationale behind not waiting for the reconciles to complete before deleting the node? An overly eager operator may call into this API on "very loaded node" ™️ and immediately proceed to nuke it leading to a period of unavailability for all computes that haven't been informed.

If you take this suggestion above, it would also be nice to limit reconcile concurrency.

VladLazar · 2024-07-03T16:42:06Z

test_runner/regress/test_storage_controller.py

+    # 1. Mark pageserver scheduling=pause
+    # 2. Mark pageserver availability=offline to trigger migrations away from it


Isn't this step racy? The node will still reply to HBs and considered active again.

jcsp added 3 commits July 1, 2024 21:35

storcon_cli: sort node + tenant listings

27081e2

storcon: resilience against bad generation_pageserver in DB

7af54f5

fix attach_hook using wrong operation type

01e2972

jcsp added t/feature Issue type: feature, for new features or requests c/storage/controller Component: Storage Controller labels Jul 1, 2024

jcsp requested a review from VladLazar July 1, 2024 20:41

jcsp added 4 commits July 2, 2024 09:42

storcon: implement tenant delete API

07dc62e

tests: add test_storage_controller_node_deletion

ed8a75e

storcon_cli: add node-delete

92e4560

tests: update compat tests

ee09248

jcsp force-pushed the jcsp/storcon-papercuts branch from c8e6a4c to ee09248 Compare July 2, 2024 08:42

jcsp marked this pull request as ready for review July 2, 2024 13:56

jcsp requested a review from a team as a code owner July 2, 2024 13:56

VladLazar reviewed Jul 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage controller: add node deletion API #8226

storage controller: add node deletion API #8226

jcsp commented Jul 1, 2024 •

edited

Loading

github-actions bot commented Jul 1, 2024 •

edited

Loading

Postgres 16

VladLazar left a comment

VladLazar Jul 3, 2024

VladLazar Jul 3, 2024

VladLazar Jul 3, 2024

		# 1. Mark pageserver scheduling=pause
		# 2. Mark pageserver availability=offline to trigger migrations away from it

storage controller: add node deletion API #8226

Are you sure you want to change the base?

storage controller: add node deletion API #8226

Conversation

jcsp commented Jul 1, 2024 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Jul 1, 2024 • edited Loading

3012 tests run: 2897 passed, 0 failed, 115 skipped (full report)

Postgres 16

Code coverage* (full report)

VladLazar left a comment

Choose a reason for hiding this comment

VladLazar Jul 3, 2024

Choose a reason for hiding this comment

VladLazar Jul 3, 2024

Choose a reason for hiding this comment

VladLazar Jul 3, 2024

Choose a reason for hiding this comment

jcsp commented Jul 1, 2024 •

edited

Loading

github-actions bot commented Jul 1, 2024 •

edited

Loading