Skip to content

Commit

Permalink
add self-healing flavor and docs on machine health checks (#256)
Browse files Browse the repository at this point in the history
  • Loading branch information
AshleyDumaine authored Apr 17, 2024
1 parent 4210732 commit f9b8308
Show file tree
Hide file tree
Showing 7 changed files with 100 additions and 3 deletions.
1 change: 1 addition & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
- [Disks](./topics/disks/disks.md)
- [OS Disk](./topics/disks/os-disk.md)
- [Data Disks](./topics/disks/data-disks.md)
- [Machine Health Checks](./topics/health-checking.md)
- [Development](./developers/development.md)
- [Releasing](./developers/releasing.md)
- [Reference](./reference/reference.md)
2 changes: 1 addition & 1 deletion docs/src/topics/backups.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ clusterctl generate cluster $CLUSTER_NAME \
--flavor etcd-backup-restore \
| kubectl apply -f -
```
For more fine-grain control and to know more about etcd backups, refere [backups.md](../topics/etcd.md)
For more fine-grain control and to know more about etcd backups, refer to [the backups section of the etcd page](../topics/etcd.md#etcd-backups)

## Object Storage

Expand Down
5 changes: 3 additions & 2 deletions docs/src/topics/etcd.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,9 @@ Users can also enable SSE (Server-side encryption) by passing a SSE AES-256 Key
[here](https://github.com/linode/cluster-api-provider-linode/blob/main/templates/addons/etcd-backup-restore/etcd-backup-restore.yaml)
on the pod can be controlled during the provisioning process.

> [!WARNING]
> This is currently under development and will be available for use once the upstream [PR](https://github.com/gardener/etcd-backup-restore/pull/719) is merged and an official image is made available
```admonish warning
This is currently under development and will be available for use once the upstream [PR](https://github.com/gardener/etcd-backup-restore/pull/719) is merged and an official image is made available
```

For eg:
```sh
Expand Down
25 changes: 25 additions & 0 deletions docs/src/topics/health-checking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Machine Health Checks

CAPL supports auto-remediation of workload cluster Nodes considered to be unhealthy
via [`MachineHealthChecks`](https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking).

## Enabling Machine Health Checks

While it is possible to manually create and apply a `MachineHealthCheck` resource into the management cluster,
using the `self-healing` flavor is the quickest way to get started:
```sh
clusterctl generate cluster $CLUSTER_NAME \
--kubernetes-version v1.29.1 \
--infrastructure linode:0.0.0 \
--flavor self-healing \
| kubectl apply -f -
```

This flavor deploys a `MachineHealthCheck` for the workers and another `MachineHealthCheck` for the control plane
of the cluster. It also configures the remediation strategy of the kubeadm control plane to prevent unnecessary load
on the infrastructure provider.

## Configuring Machine Health Checks

Refer to the [Cluster API documentation](https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking)
for further information on configuring and using `MachineHealthChecks`.
4 changes: 4 additions & 0 deletions templates/addons/machine-health-check/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- machinehealthcheck.yaml
46 changes: 46 additions & 0 deletions templates/addons/machine-health-check/machinehealthcheck.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
name: ${CLUSTER_NAME}-node-unhealthy-5m
spec:
clusterName: ${CLUSTER_NAME}
# (Optional) maxUnhealthy prevents further remediation if the cluster is already partially unhealthy
maxUnhealthy: 40%
# (Optional) nodeStartupTimeout determines how long a MachineHealthCheck should wait for
# a Node to join the cluster, before considering a Machine unhealthy.
# Defaults to 10 minutes if not specified.
# Set to 0 to disable the node startup timeout.
# Disabling this timeout will prevent a Machine from being considered unhealthy when
# the Node it created has not yet registered with the cluster. This can be useful when
# Nodes take a long time to start up or when you only want condition based checks for
# Machine health.
nodeStartupTimeout: 10m
# Conditions to check on Nodes for matched Machines, if any condition is matched for the duration of its timeout, the Machine is considered unhealthy
selector:
matchLabels:
cluster.x-k8s.io/deployment-name: ${CLUSTER_NAME}-md-0
unhealthyConditions:
- type: Ready
status: Unknown
timeout: 300s
- type: Ready
status: "False"
timeout: 300s
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
name: ${CLUSTER_NAME}-kcp-unhealthy-5m
spec:
clusterName: ${CLUSTER_NAME}
maxUnhealthy: 100%
selector:
matchLabels:
cluster.x-k8s.io/control-plane: ""
unhealthyConditions:
- type: Ready
status: Unknown
timeout: 300s
- type: Ready
status: "False"
timeout: 300s
20 changes: 20 additions & 0 deletions templates/flavors/self-healing/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../default
- ../../addons/machine-health-check
patches:
- target:
group: controlplane.cluster.x-k8s.io
version: v1beta1
kind: KubeadmControlPlane
patch: |-
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
name: ${CLUSTER_NAME}-control-plane
spec:
remediationStrategy:
maxRetry: 5
retryPeriod: 2m
minHealthyPeriod: 2h

0 comments on commit f9b8308

Please sign in to comment.