Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draining node is called multiple times (30+) while deleting a machine #130

Open
nasusoba opened this issue Jun 7, 2024 · 2 comments
Open

Comments

@nasusoba
Copy link
Contributor

nasusoba commented Jun 7, 2024

I created k3s cluster with CAPZ clusterclass, while deleting a machine, even though node drain is successful, CAPI keeps draining the same machine for 30+ times. Need more investigation if this is specific to capi k3s.

Log:

#1st time
I0606 09:56:31.424772       1 machine_controller.go:379] "Draining node" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="e220ef5b-d4d8-454d-a787-cc29777a4d31" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
E0606 09:56:33.831664       1 machine_controller.go:669] "WARNING: ignoring DaemonSet-managed Pods: kube-system/cloud-node-manager-877r7, kube-system/etcd-proxy-dq28c, kube-system/svclb-traefik-9d54adee-jmxgh\n" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="e220ef5b-d4d8-454d-a787-cc29777a4d31" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
I0606 09:56:34.146008       1 machine_controller.go:927] "evicting pod kube-system/traefik-f4564c4f4-fhqsl\n" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="e220ef5b-d4d8-454d-a787-cc29777a4d31" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
...
E0606 09:56:54.324252       1 machine_controller.go:686] "Drain failed, retry in 20s" err="error when waiting for pod \"local-path-provisioner-84db5d44d9-lhvh6\" in namespace \"kube-system\" to terminate: global timeout reached: 20s" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="e220ef5b-d4d8-454d-a787-cc29777a4d31" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
....
# 2nd time (repeated for 30+ times under 2mins)
I0606 09:57:05.236640       1 machine_controller.go:379] "Draining node" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="a706ce50-e0f7-443c-bcd7-33a3168a85ab" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
E0606 09:57:08.328683       1 machine_controller.go:669] "WARNING: ignoring DaemonSet-managed Pods: kube-system/cloud-node-manager-877r7, kube-system/etcd-proxy-dq28c, kube-system/svclb-traefik-9d54adee-jmxgh\n" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="a706ce50-e0f7-443c-bcd7-33a3168a85ab" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
I0606 09:57:08.328722       1 machine_controller.go:690] "Drain successful" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="a706ce50-e0f7-443c-bcd7-33a3168a85ab" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
...
# 37th time -> machine got deleted
I0606 09:59:42.458023       1 machine_controller.go:379] "Draining node" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="af7a743d-dec6-402f-a16d-843dc438d654" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
E0606 09:59:42.765445       1 machine_controller.go:643] "Could not find node from noderef, it may have already been deleted" err="nodes \"demok3s-tbbs6-hmxp4\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="af7a743d-dec6-402f-a16d-843dc438d654" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
E0606 09:59:42.765538       1 machine_controller.go:710] "Could not find node from noderef, it may have already been deleted" err="Node \"demok3s-tbbs6-hmxp4\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="af7a743d-dec6-402f-a16d-843dc438d654" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
I0606 09:59:42.766135       1 machine_controller.go:468] "Deleting node" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="af7a743d-dec6-402f-a16d-843dc438d654" KThreesControlPlane="default/demok3s-jtx7s" Cluster="default/demok3s" Node="demok3s-tbbs6-hmxp4"
E0606 09:59:43.098869       1 controller.go:329] "Reconciler error" err="failed to patch Machine default/demok3s-jtx7s-tpjl2: machines.cluster.x-k8s.io \"demok3s-jtx7s-tpjl2\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/demok3s-jtx7s-tpjl2" namespace="default" name="demok3s-jtx7s-tpjl2" reconcileID="af7a743d-dec6-402f-a16d-843dc438d654"

Could be solved by adding setting nodedraintimeout to something short like 30s. But it is a potential issue if we keep sending lots of unnecessary messages to api server.

@mogliang
Copy link
Collaborator

it seems that after removing etcd from the machine to be deleted, that machine still receives apiserver requests but failed to responds.

Let's confirm if that's the case, and why loadbalaner doesn't remove the unresponding apiserver.

@nasusoba
Copy link
Contributor Author

nasusoba commented Jun 13, 2024

it seems that after removing etcd from the machine to be deleted, that machine still receives apiserver requests but failed to responds.

Let's confirm if that's the case, and why loadbalaner doesn't remove the unresponding apiserver.

More detail about the potential bug we found:

Bug

I created a local management cluster, and created a k3s workload cluster on Azure with 3 CPs. I trigger a upgrade for k3s version. During upgrade, I observed the upgrade process takes a very long time, from the log, I observed that k3s controlplane controller complains that it could not reach the api server. The problem persists and after an hour, it finally being able to reach the api server. What is interesting is that the api server is reachable with kubeconfig during this period.

E0611 08:05:43.195801       1 controller.go:329] "Reconciler error" err="failed to create cluster accessor: error creating http client and mapper for remote cluster \"default/demok3s\": error creating client for remote cluster \"default/demok3s\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://demok3s-cd37da3.eastus.cloudapp.azure.com:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="clusterresourceset" controllerGroup="addons.cluster.x-k8s.io" controllerKind="ClusterResourceSet" ClusterResourceSet="default/crs-azure-cn" namespace="default" name="crs-azure-cn" reconcileID="4652e61f-c74d-4182-a35d-47001dd5db07"
(repeating.....)

Also, draining node could also fail due to timeout, the log from CAPI core:

I0611 08:25:21.039890       1 recorder.go:104] "error draining Machine's node \"demok3s-zbwfr-9vkxq\": unable to get node demok3s-zbwfr-9vkxq: Get \"https://demok3s-cd37da3.eastus.cloudapp.azure.com:6443/api/v1/nodes/demok3s-zbwfr-9vkxq?timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)" logger="events" type="Warning" object={"kind":"Machine","namespace":"default","name":"demok3s-9xzb4-d5vxx","uid":"de5598fb-ac2c-4f2f-9bf6-ae8f4df1923b","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"40683"} reason="FailedDrainNode"
(repeating.....)

By moving the management cluster to Azure, the problem seems to be fixed, but we also observed the timeout log occassionally, lasting for ~1 minutes.

Potential cause

This is the image for the Azure load balancer for control plane nodes. During upgrade, a machine is being deleting (in this image, machine z95c7 is being deleting, and we have removed its etcd member, and its api server is not reachable), however, it is still inside the azure load balancer pool. We are suspecting that k3s controlplane controller is trying to sending request to the deleting machine, which causing the timeout.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants