Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add job for graceful node shutdown feature #32728

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

torredil
Copy link
Member

@torredil torredil commented Jun 10, 2024

This PR introduces a new job to validate the graceful node shutdown feature:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/config Issues or PRs related to code in /config size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/jobs area/provider/gcp Issues or PRs related to gcp provider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jun 10, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: torredil
Once this PR has been reviewed and has the lgtm label, please assign cheftako for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@torredil
Copy link
Member Author

The regex change made in pull-kubernetes-e2e-gce ensures that the e2e test implemented in PR 125070 is skipped (graceful node shutdown is not enabled for that job)

@dims
Copy link
Member

dims commented Jun 11, 2024

/assign @mrunalp

@pacoxu
Copy link
Member

pacoxu commented Jun 17, 2024

/cc @wzshiming @bobbypage

@aojea
Copy link
Member

aojea commented Jun 17, 2024

/hold

is not clear to me the direction of these changes, the feature does not seem to have e2e tests (only e2e_node) and makes some existing tests to be skipped

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2024
@torredil
Copy link
Member Author

@aojea

the feature does not seem to have e2e tests (only e2e_node)

That is correct - today, NodeFeature:GracefulNodeShutdown does not have any e2e tests, only e2e_node tests. The purpose of this PR is to enable writing full cluster e2e tests for that feature - there are no jobs which currently create a test cluster with the Kubelet graceful shutdown feature enabled.

see this PR for more context: kubernetes/kubernetes#125070. I synced up with @msau42 offline and we agreed to implement the e2e test written in that PR under e2e. That test assumes that graceful shutdown is enabled on the node. It cannot be part of e2e_node because it relies on setting up CSI drivers, provisioning volumes, and so on, which is not possible to do in the limited test environment created by e2e_node.

@torredil torredil force-pushed the master branch 2 times, most recently from e67751e to 8ebdda9 Compare June 17, 2024 22:21
@bobbypage
Copy link
Member

bobbypage commented Jun 19, 2024

That is correct - today, NodeFeature:GracefulNodeShutdown does not have any e2e tests, only e2e_node tests. The purpose of this PR is to enable writing full cluster e2e tests for that feature - there are no jobs which currently create a test cluster with the Kubelet graceful shutdown feature enabled.

Sorry, I might be missing some context here but what is the purpose of having an e2e test for this vs relying on the existing node e2e tests? All of the logic for graceful shutdown is node specific so I'm trying to understand what additional signal a cluster e2e would provide.

@torredil
Copy link
Member Author

@bobbypage, the node specific logic is already covered by e2e_node tests as you mention, but we need to validate how the shutdown process interacts with other cluster components like the A/D controller and CSI drivers.

For example, users often observe 6 minute delays for stateful pods to enter a running state after a node is gracefully terminated, due to having to wait for the A/D controller to issue a force detach if volumes were not unmounted in time.

^ kubernetes/kubernetes#125070 addresses that delay by taking volume status into account before proceeding with termination and it includes an e2e test to validate that stateful pods enter a running state in a timely manner (as expected when nodes are gracefully terminated). My understanding is that this level of validation is not possible with e2e_node tests alone.

@torredil
Copy link
Member Author

cc: @bobbypage for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/config Issues or PRs related to code in /config area/jobs area/provider/gcp Issues or PRs related to gcp provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants