Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update error-code-poddrainfailure.md #1742

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ metadata:
title: Azure Kubernetes Service (AKS) common issues FAQ
description: Review a list of frequently asked questions (FAQ) about common issues when you're working with an Azure Kubernetes Service (AKS) cluster.
ms.topic: faq
ms.date: 11/14/2023
ms.date: 12/17/2024
ms.reviewer: chiragpa, nickoman, v-leedennis
ms.service: azure-kubernetes-service
ms.custom: sap:Create, Upgrade, Scale and Delete operations (cluster or nodepool)
Expand All @@ -26,7 +26,7 @@ sections:
- question: |
Can I move my cluster to a different subscription, or move my subscription with my cluster to a new tenant?
answer: |
If you've moved your AKS cluster to a different subscription or the cluster's subscription to a new tenant, the cluster won't function because of missing cluster identity permissions. AKS doesn't support moving clusters across subscriptions or tenants because of this constraint.
No. If you've moved your AKS cluster to a different subscription or the cluster's subscription to a new tenant, the cluster won't function because of missing cluster identity permissions. AKS doesn't support moving clusters across subscriptions or tenants because of this constraint. More details you can check in the [Operations FAQ](https://learn.microsoft.com/en-us/azure/aks/faq#operations) documentation

- question: |
What naming restrictions are enforced for AKS resources and parameters?
Expand All @@ -42,6 +42,9 @@ sections:
- AKS node pool names must be all lowercase. The names must be 1-12 characters in length for Linux node pools and 1-6 characters for Windows node pools. A name must start with a letter, and the only allowed characters are letters and numbers.

- The *admin-username*, which sets the administrator user name for Linux nodes, must start with a letter. This user name may only contain letters, numbers, hyphens, and underscores. It has a maximum length of 32 characters.
Further details about naming convention are available in following links:
- https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/resource-name-rules#microsoftcontainerservice
- https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/azure-best-practices/resource-abbreviations#containers

additionalContent: |
[!INCLUDE [Third-party disclaimer](../../../includes/third-party-disclaimer.md)]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Troubleshoot UpgradeFailed errors due to eviction failures caused by PDBs
description: Learn how to troubleshoot UpgradeFailed errors due to eviction failures caused by Pod Disruption Budgets when you try to upgrade an Azure Kubernetes Service cluster.
ms.date: 12/21/2023
ms.date: 12/13/2024
editor: v-jsitser
ms.reviewer: chiragpa, v-leedennis, v-weizhu
ms.service: azure-kubernetes-service
Expand All @@ -15,24 +15,55 @@ This article discusses how to identify and resolve UpgradeFailed errors due to e

## Prerequisites

This article requires Azure CLI version 2.0.65 or a later version. To find the version number, run `az --version`. If you have to install or upgrade Azure CLI, see [How to install the Azure CLI](/cli/azure/install-azure-cli).
This article requires Azure CLI version 2.67.0 or a later version. To find the version number, run `az --version`. If you have to install or upgrade Azure CLI, see [How to install the Azure CLI](/cli/azure/install-azure-cli).

For more detailed information about the upgrade process, see the "Upgrade an AKS cluster" section in [Upgrade an Azure Kubernetes Service (AKS) cluster](/azure/aks/upgrade-cluster#upgrade-an-aks-cluster).

## Symptoms

An AKS cluster upgrade operation fails with the following error message:
An AKS cluster upgrade operation fails with one of the following error messages:

> Code: UpgradeFailed
> Message: Drain node \<node-name> failed when evicting pod \<pod-name>. Eviction failed with Too many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See `http://aka.ms/aks/debugdrainfailures`. Original error: API call to Kubernetes API Server failed.
> (UpgradeFailed) Drain `node aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx` failed when evicting pod `<pod-name>` failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See https://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: `<namespace>/<pod-name>` blocked by pdb `<pdb-name>` with 0 unready pods.

> Code: UpgradeFailed
> Message: Drain node `aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx` failed when evicting pod `<pod-name>` failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See https://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: `<namespace>/<pod-name>` blocked by pdb `<pdb-name>` with 0 unready pods.

## Cause

This error might occur if a pod is protected by the Pod Disruption Budget (PDB) policy. In this situation, the pod resists being drained.
This error might occur if a pod is protected by the Pod Disruption Budget (PDB) policy. In this situation, the pod resists being drained and the upgrade operation after several attempts fails and the cluter/node pool fall in `Failed` state.

Check the PDB configuration: **ALLOWED DISRUPTIONS** value. The value should be **1** or greater. For more information, see [Plan for availability using pod disruption budgets](/azure/aks/operator-best-practices-scheduler#plan-for-availability-using-pod-disruption-budgets). For instance, you can check the workload and its PDB as following. You should observe the **ALLOWED DISRUPTION** column doesn't allow any disruption. If the **ALLOWED DISRUPTIONS** value is **0**, the pods won't be evicted and node drain will fail during the upgrade process:

```output
$ kubectl get deployments.apps nginx
NAME READY UP-TO-DATE AVAILABLE AGE
nginx 2/2 2 2 62s

$ kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx-7854ff8877-gbr4m 1/1 Running 0 68s
nginx-7854ff8877-gnltd 1/1 Running 0 68s

$ kubectl get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
nginx-pdb 2 N/A 0 24s

```

To test this situation, run `kubectl get pdb -A`, and then check the **Allowed Disruption** value. The value should be **1** or greater. For more information, see [Plan for availability using pod disruption budgets](/azure/aks/operator-best-practices-scheduler#plan-for-availability-using-pod-disruption-budgets).
You can also check for any entries in Kubernetes events with the command `kubectl get events | grep -i drain`. A similar output below will show the information `Eviction blocked by Too Many Requests (usually a pdb)`:

```output
$ kubectl get events | grep -i drain
LAST SEEN TYPE REASON OBJECT MESSAGE
(...)
32m Normal Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Draining node: aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx
2m57s Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
12m Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
32m Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
32m Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
31m Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
```

If the **Allowed Disruption** value is **0**, the node drain will fail during the upgrade process.

To resolve this issue, use one of the following solutions.

Expand All @@ -41,18 +72,41 @@ To resolve this issue, use one of the following solutions.
1. Adjust the PDB to enable pod draining. Generally, The allowed disruption is controlled by the `Min Available / Max unavailable` or `Running pods / Replicas` parameter. You can modify the `Min Available / Max unavailable` parameter at the PDB level or increase the number of `Running pods / Replicas` to push the Allowed Disruption value to **1** or greater.
2. Try again to upgrade the AKS cluster to the same version that you tried to upgrade to previously. This process will trigger a reconciliation.

```output
$ az aks upgrade --name <aksName> --resource-group <resourceGroupName>
Are you sure you want to perform this operation? (y/N): y
Cluster currently in failed state. Proceeding with upgrade to existing version 1.28.3 to attempt resolution of failed cluster state.
Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version . Continue? (y/N): y
```

## Solution 2: Back up, delete, and redeploy the PDB

1. Take a backup of the PDB `kubectl get pdb <pdb-name> -n <pdb-namespace> -o yaml > pdb_backup.yaml`, and then delete the PDB `kubectl delete pdb <pdb-name> -n /<pdb-namespace>`. After the upgrade is finished, you can redeploy the PDB `kubectl apply -f pdb_backup.yaml`.
1. Take a backup of the PDB(s) `kubectl get pdb <pdb-name> -n <pdb-namespace> -o yaml > pdb-name-backup.yaml`, and then delete the PDB `kubectl delete pdb <pdb-name> -n <pdb-namespace>`. After the new upgrade attempt is finished, you can redeploy the PDB just applying the backup file: `kubectl apply -f pdb-name-backup.yaml`.
2. Try again to upgrade the AKS cluster to the same version that you tried to upgrade to previously. This process will trigger a reconciliation.

## Solution 3: Delete the pods that can't be drained
```output
$ az aks upgrade --name <aksName> --resource-group <resourceGroupName>
Are you sure you want to perform this operation? (y/N): y
Cluster currently in failed state. Proceeding with upgrade to existing version 1.28.3 to attempt resolution of failed cluster state.
Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version . Continue? (y/N): y
```

## Solution 3: Delete the pods that can't be drained or scale the workload down to zero (0)

1. Delete the pods that can't be drained.

> [!NOTE]
> If the pods were created by a deployment or StatefulSet, they'll be controlled by a ReplicaSet. If that's the case, you might have to delete the deployment or StatefulSet. Before you do that, we recommend that you make a backup: `kubectl get <kubernetes-object> <name> -n <namespace> -o yaml > backup.yaml`.
> If the pods were created by a Deployment or StatefulSet, they'll be controlled by a ReplicaSet. If that's the case, you might have to delete or scale the workload replicas to zero (0) of the Deployment or StatefulSet. Before you do that, we recommend that you make a backup: `kubectl get <deployment.apps -or- statefulset.apps> <name> -n <namespace> -o yaml > backup.yaml`.

2. Try again to upgrade the AKS cluster to the same version that you tried to upgrade to previously. This process will trigger a reconciliation.
2. To scale-down, you can use `kubectl scale --replicas=0 <deployment.apps -or- statefulset.apps> <name> -n <namespace>` before the reconciliation

3. Try again to upgrade the AKS cluster to the same version that you tried to upgrade to previously. This process will trigger a reconciliation.

```output
$ az aks upgrade --name <aksName> --resource-group <resourceGroupName>
Are you sure you want to perform this operation? (y/N): y
Cluster currently in failed state. Proceeding with upgrade to existing version 1.28.3 to attempt resolution of failed cluster state.
Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version . Continue? (y/N): y
```

[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]