Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up cluster-administration/node-shutdown.md #49146

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 59 additions & 50 deletions content/en/docs/concepts/cluster-administration/node-shutdown.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,27 @@ weight: 10
---

<!-- overview -->

In a Kubernetes cluster, a {{< glossary_tooltip text="node" term_id="node" >}}
can be shutdown in a planned graceful way or unexpectedly because of reasons such
can be shut down in a planned graceful way or unexpectedly because of reasons such
as a power outage or something else external. A node shutdown could lead to workload
failure if the node is not drained before the shutdown. A node shutdown can be
either **graceful** or **non-graceful**.

<!-- body -->

## Graceful node shutdown {#graceful-node-shutdown}

{{< feature-state feature_gate_name="GracefulNodeShutdown" >}}

The kubelet attempts to detect node system shutdown and terminates pods running on the node.

Kubelet ensures that pods follow the normal
kubelet ensures that pods follow the normal
[pod termination process](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination)
during the node shutdown. During node shutdown, the kubelet does not accept new
Pods (even if those Pods are already bound to the node).

The Graceful node shutdown feature depends on systemd since it takes advantage of
The graceful node shutdown feature depends on systemd since it takes advantage of
[systemd inhibitor locks](https://www.freedesktop.org/wiki/Software/systemd/inhibit/) to
delay the node shutdown with a given duration.

Expand All @@ -32,12 +34,12 @@ Graceful node shutdown is controlled with the `GracefulNodeShutdown`
enabled by default in 1.21.

Note that by default, both configuration options described below,
`shutdownGracePeriod` and `shutdownGracePeriodCriticalPods` are set to zero,
`shutdownGracePeriod` and `shutdownGracePeriodCriticalPods`, are set to zero,
thus not activating the graceful node shutdown functionality.
To activate the feature, the two kubelet config settings should be configured appropriately and
To activate the feature, both options should be configured appropriately and
set to non-zero values.

Once systemd detects or notifies node shutdown, the kubelet sets a `NotReady` condition on
Once systemd detects or is notified of a node shutdown, the kubelet sets a `NotReady` condition on
the Node, with the `reason` set to `"node is shutting down"`. The kube-scheduler honors this condition
and does not schedule any Pods onto the affected node; other third-party schedulers are
expected to follow the same logic. This means that new Pods won't be scheduled onto that node
Expand All @@ -48,26 +50,29 @@ node shutdown has been detected, so that even Pods with a
{{< glossary_tooltip text="toleration" term_id="toleration" >}} for
`node.kubernetes.io/not-ready:NoSchedule` do not start there.

At the same time when kubelet is setting that condition on its Node via the API,
When kubelet is setting that condition on its Node via the API,
the kubelet also begins terminating any Pods that are running locally.

During a graceful shutdown, kubelet terminates pods in two phases:

1. Terminate regular pods running on the node.
2. Terminate [critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical)
1. Terminate [critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical)
running on the node.

Graceful node shutdown feature is configured with two
The graceful node shutdown feature is configured with two
[`KubeletConfiguration`](/docs/tasks/administer-cluster/kubelet-config-file/) options:

* `shutdownGracePeriod`:
* Specifies the total duration that the node should delay the shutdown by. This is the total
grace period for pod termination for both regular and
[critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical).
* `shutdownGracePeriodCriticalPods`:
* Specifies the duration used to terminate
[critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical)
during a node shutdown. This value should be less than `shutdownGracePeriod`.
- `shutdownGracePeriod`:

Specifies the total duration that the node should delay the shutdown by. This is the total
grace period for pod termination for both regular and
[critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical).

- `shutdownGracePeriodCriticalPods`:

Specifies the duration used to terminate
[critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical)
during a node shutdown. This value should be less than `shutdownGracePeriod`.

{{< note >}}

Expand Down Expand Up @@ -122,22 +127,22 @@ Assuming the following custom pod
[priority classes](/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass)
in a cluster,

|Pod priority class name|Pod priority class value|
|-------------------------|------------------------|
|`custom-class-a` | 100000 |
|`custom-class-b` | 10000 |
|`custom-class-c` | 1000 |
|`regular/unset` | 0 |
| Pod priority class name | Pod priority class value |
| ----------------------- | ------------------------ |
| `custom-class-a` | 100000 |
| `custom-class-b` | 10000 |
| `custom-class-c` | 1000 |
| `regular/unset` | 0 |

Within the [kubelet configuration](/docs/reference/config-api/kubelet-config.v1beta1/)
the settings for `shutdownGracePeriodByPodPriority` could look like:

|Pod priority class value|Shutdown period|
|------------------------|---------------|
| 100000 |10 seconds |
| 10000 |180 seconds |
| 1000 |120 seconds |
| 0 |60 seconds |
| Pod priority class value | Shutdown period |
| ------------------------ | --------------- |
| 100000 | 10 seconds |
| 10000 | 180 seconds |
| 1000 | 120 seconds |
| 0 | 60 seconds |

The corresponding kubelet config YAML configuration would be:

Expand All @@ -154,18 +159,18 @@ shutdownGracePeriodByPodPriority:
```

The above table implies that any pod with `priority` value >= 100000 will get
just 10 seconds to stop, any pod with value >= 10000 and < 100000 will get 180
seconds to stop, any pod with value >= 1000 and < 10000 will get 120 seconds to stop.
Finally, all other pods will get 60 seconds to stop.
just 10 seconds to shut down, any pod with value >= 10000 and < 100000 will get 180
Copy link
Member Author

@windsonsea windsonsea Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep consistent with the context code though it has similar meaning between stop and shut down:

shutdownGracePeriodByPodPriority:
  - priority: 100000
    shutdownGracePeriodSeconds: 10

seconds to shut down, any pod with value >= 1000 and < 10000 will get 120 seconds to shut down.
Finally, all other pods will get 60 seconds to shut down.

One doesn't have to specify values corresponding to all of the classes. For
example, you could instead use these settings:

|Pod priority class value|Shutdown period|
|------------------------|---------------|
| 100000 |300 seconds |
| 1000 |120 seconds |
| 0 |60 seconds |
| Pod priority class value | Shutdown period |
| ------------------------ | --------------- |
| 100000 | 300 seconds |
| 1000 | 120 seconds |
| 0 | 60 seconds |

In the above case, the pods with `custom-class-b` will go into the same bucket
as `custom-class-c` for shutdown.
Expand Down Expand Up @@ -225,14 +230,16 @@ on a different node.
During a non-graceful shutdown, Pods are terminated in the two phases:

1. Force delete the Pods that do not have matching `out-of-service` tolerations.
2. Immediately perform detach volume operation for such pods.
1. Immediately perform detach volume operation for such pods.

{{< note >}}

- Before adding the taint `node.kubernetes.io/out-of-service`, it should be verified
that the node is already in shutdown or power off state (not in the middle of restarting).
- The user is required to manually remove the out-of-service taint after the pods are
moved to a new node and the user has checked that the shutdown node has been
recovered since the user was the one who originally added the taint.

{{< /note >}}

### Forced storage detach on timeout {#storage-force-detach-on-timeout}
Expand All @@ -256,39 +263,41 @@ its associated
[VolumeAttachment](/docs/reference/kubernetes-api/config-and-storage-resources/volume-attachment-v1/)
deleted.

After this setting has been applied, unhealthy pods still attached to a volumes must be recovered
After this setting has been applied, unhealthy pods still attached to volumes must be recovered
via the [Non-Graceful Node Shutdown](#non-graceful-node-shutdown) procedure mentioned above.

{{< note >}}

- Caution must be taken while using the [Non-Graceful Node Shutdown](#non-graceful-node-shutdown) procedure.
- Deviation from the steps documented above can result in data corruption.
{{< /note >}}

{{< /note >}}

## Windows Graceful node shutdown {#windows-graceful-node-shutdown}

{{< feature-state feature_gate_name="WindowsGracefulNodeShutdown" >}}

The Windows graceful node shutdown feature depends on kubelet running as a Windows service,
it will then have a registered [service control handler](https://learn.microsoft.com/en-us/windows/win32/services/service-control-handler-function)
to delay the presshutdown event with a given duration.
The Windows graceful node shutdown feature depends on kubelet running as a Windows service,
it will then have a registered [service control handler](https://learn.microsoft.com/en-us/windows/win32/services/service-control-handler-function)
to delay the preshutdown event with a given duration.

Windows graceful node shutdown is controlled with the `WindowsGracefulNodeShutdown`
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
Windows graceful node shutdown is controlled with the `WindowsGracefulNodeShutdown`
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
which is introduced in 1.32 as an alpha feature.

Windows graceful node shutdown can not be cancelled.

If Kubelet is not running as a Windows service, it will not be able to set and monitor
If kubelet is not running as a Windows service, it will not be able to set and monitor
the [Preshutdown](https://learn.microsoft.com/en-us/windows/win32/api/winsvc/ns-winsvc-service_preshutdown_info) event,
the node will have to go through the [Non-Graceful Node Shutdown](#non-graceful-node-shutdown) procedure mentioned above.

In the case where the Windows graceful node shutdown feature is enabled, but the kubelet is not
running as a Windows service, the kubelet will continue running instead of failing. However,
In the case where the Windows graceful node shutdown feature is enabled, but the kubelet is not
running as a Windows service, the kubelet will continue running instead of failing. However,
it will log an error indicating that it needs to be run as a Windows service.

## {{% heading "whatsnext" %}}

Learn more about the following:
* Blog: [Non-Graceful Node Shutdown](/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/).
* Cluster Architecture: [Nodes](/docs/concepts/architecture/nodes/).

- Blog: [Non-Graceful Node Shutdown](/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/).
- Cluster Architecture: [Nodes](/docs/concepts/architecture/nodes/).