Add job for graceful node shutdown feature #32728

torredil · 2024-06-10T12:08:42Z

This PR introduces a new job to validate the graceful node shutdown feature:

Configured to run only tests labeled with [Feature:GracefulNodeShutdown].
The job uses the SHUTDOWN_GRACE_PERIOD & SHUTDOWN_GRACE_PERIOD_CRITICAL_PODS environment variables to enable the graceful shutdown feature, see Conditionally add the graceful shutdown Kubelet parameters kubernetes#125413 (comment)
optional and does not always run by default.

k8s-ci-robot · 2024-06-10T12:08:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: torredil
Once this PR has been reviewed and has the lgtm label, please assign cheftako for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

config/jobs/kubernetes/sig-cloud-provider/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

torredil · 2024-06-10T18:55:11Z

The regex change made in pull-kubernetes-e2e-gce ensures that the e2e test implemented in PR 125070 is skipped (graceful node shutdown is not enabled for that job)

dims · 2024-06-11T17:03:35Z

/assign @mrunalp

pacoxu · 2024-06-17T09:38:32Z

/cc @wzshiming @bobbypage

config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml

aojea · 2024-06-17T18:14:06Z

/hold

is not clear to me the direction of these changes, the feature does not seem to have e2e tests (only e2e_node) and makes some existing tests to be skipped

torredil · 2024-06-17T20:13:19Z

@aojea

the feature does not seem to have e2e tests (only e2e_node)

That is correct - today, NodeFeature:GracefulNodeShutdown does not have any e2e tests, only e2e_node tests. The purpose of this PR is to enable writing full cluster e2e tests for that feature - there are no jobs which currently create a test cluster with the Kubelet graceful shutdown feature enabled.

see this PR for more context: kubernetes/kubernetes#125070. I synced up with @msau42 offline and we agreed to implement the e2e test written in that PR under e2e. That test assumes that graceful shutdown is enabled on the node. It cannot be part of e2e_node because it relies on setting up CSI drivers, provisioning volumes, and so on, which is not possible to do in the limited test environment created by e2e_node.

config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml

Signed-off-by: torredil <[email protected]>

bobbypage · 2024-06-19T03:55:58Z

That is correct - today, NodeFeature:GracefulNodeShutdown does not have any e2e tests, only e2e_node tests. The purpose of this PR is to enable writing full cluster e2e tests for that feature - there are no jobs which currently create a test cluster with the Kubelet graceful shutdown feature enabled.

Sorry, I might be missing some context here but what is the purpose of having an e2e test for this vs relying on the existing node e2e tests? All of the logic for graceful shutdown is node specific so I'm trying to understand what additional signal a cluster e2e would provide.

torredil · 2024-06-19T15:54:22Z

@bobbypage, the node specific logic is already covered by e2e_node tests as you mention, but we need to validate how the shutdown process interacts with other cluster components like the A/D controller and CSI drivers.

For example, users often observe 6 minute delays for stateful pods to enter a running state after a node is gracefully terminated, due to having to wait for the A/D controller to issue a force detach if volumes were not unmounted in time.

^ kubernetes/kubernetes#125070 addresses that delay by taking volume status into account before proceeding with termination and it includes an e2e test to validate that stateful pods enter a running state in a timely manner (as expected when nodes are gracefully terminated). My understanding is that this level of validation is not possible with e2e_node tests alone.

torredil · 2024-06-24T13:13:21Z

cc: @bobbypage for review.

k8s-ci-robot requested review from andrewsykim and cheftako June 10, 2024 12:08

torredil force-pushed the master branch 3 times, most recently from 5362a44 to d27b4c7 Compare June 10, 2024 12:39

This was referenced Jun 10, 2024

Ensure volumes are unmounted during graceful node shutdown kubernetes/kubernetes#125070

Open

Conditionally add the graceful shutdown Kubelet flags kubernetes/cloud-provider-gcp#721

Closed

torredil force-pushed the master branch 2 times, most recently from 7459705 to dfb8330 Compare June 10, 2024 18:39

k8s-ci-robot assigned mrunalp Jun 11, 2024

k8s-ci-robot requested review from bobbypage and wzshiming June 17, 2024 09:38

aojea reviewed Jun 17, 2024

View reviewed changes

config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml Outdated Show resolved Hide resolved

aojea reviewed Jun 17, 2024

View reviewed changes

config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2024

torredil force-pushed the master branch 2 times, most recently from e67751e to 8ebdda9 Compare June 17, 2024 22:21

torredil mentioned this pull request Jun 17, 2024

Conditionally add the graceful shutdown Kubelet parameters kubernetes/kubernetes#125413

Merged

aojea reviewed Jun 18, 2024

View reviewed changes

config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml Outdated Show resolved Hide resolved

Add job for graceful node shutdown feature

ea80eb7

Signed-off-by: torredil <[email protected]>

torredil force-pushed the master branch from 8ebdda9 to ea80eb7 Compare June 18, 2024 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add job for graceful node shutdown feature #32728

Add job for graceful node shutdown feature #32728

torredil commented Jun 10, 2024 •

edited

Loading

k8s-ci-robot commented Jun 10, 2024

torredil commented Jun 10, 2024

dims commented Jun 11, 2024

pacoxu commented Jun 17, 2024

aojea commented Jun 17, 2024

torredil commented Jun 17, 2024

bobbypage commented Jun 19, 2024 •

edited

Loading

torredil commented Jun 19, 2024

torredil commented Jun 24, 2024

Add job for graceful node shutdown feature #32728

Are you sure you want to change the base?

Add job for graceful node shutdown feature #32728

Conversation

torredil commented Jun 10, 2024 • edited Loading

k8s-ci-robot commented Jun 10, 2024

torredil commented Jun 10, 2024

dims commented Jun 11, 2024

pacoxu commented Jun 17, 2024

aojea commented Jun 17, 2024

torredil commented Jun 17, 2024

bobbypage commented Jun 19, 2024 • edited Loading

torredil commented Jun 19, 2024

torredil commented Jun 24, 2024

torredil commented Jun 10, 2024 •

edited

Loading

bobbypage commented Jun 19, 2024 •

edited

Loading