Cluster outage after patch update 1.28.14 -> 1.28.15 #4750

slapcat · 2024-11-20T14:29:28Z

Summary

On November 9 we experienced an outage on Charmed Microk8s 1.28/stable when the snap automatically refreshed from 1.28.14 to 1.28.15. Several of the kube-system pods restarted and failed to come up, either in pending or crashloopbackoff. These were dispersed across all 3 nodes in the cluster, so it was not just one with the issue. We also noticed all the calico pods were deleted and recreated (not just restarted). They also failed to come up, which could be what caused issues with the other pods.

What Should Happen Instead?

Patch updates should not cause service disruptions.

Reproduction Steps

Cannot reproduce since this was an automatic snap update.

Introspection Report

We did not collect one before resolving the incident, but this is the journal log from around the time of the refresh (17:07): https://pastebin.canonical.com/p/rV43qDMPw2/

After the refresh, I see many "task not found" errors like these:

Nov 09 17:36:47 microk8s-1 microk8s.daemon-kubelite[1826546]: E1109 23:36:47.015909 1826546 manager.go:1106] Failed to create existing container: /kubepods/besteffort/pod376b187f-f1cb-426d-b39d-130687311b1d/a7d239e3b3b5933c4e5cee3a21da35ea653acbbc8c4e604dd9852309cd89e508: task a7d239e3b3b5933c4e5cee3a21da35ea653acbbc8c4e604dd9852309cd89e508 not found: not found
Nov 09 17:36:50 microk8s-1 microk8s.daemon-kubelite[1826546]: E1109 23:36:50.128747 1826546 manager.go:1106] Failed to create existing container: /kubepods/besteffort/pod376b187f-f1cb-426d-b39d-130687311b1d/84ab7836363e64c15484a9eda4e0aad4fdcaff7b4b08d42a1fe161e43631a5a3: task 84ab7836363e64c15484a9eda4e0aad4fdcaff7b4b08d42a1fe161e43631a5a3 not found: not found

Can you suggest a fix?

Rebooting the nodes one-by-one resolved the issue.

Are you interested in contributing with a fix?

@ktsakalozos

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster outage after patch update 1.28.14 -> 1.28.15 #4750

Cluster outage after patch update 1.28.14 -> 1.28.15 #4750

slapcat commented Nov 20, 2024

Cluster outage after patch update 1.28.14 -> 1.28.15 #4750

Cluster outage after patch update 1.28.14 -> 1.28.15 #4750

Comments

slapcat commented Nov 20, 2024

Summary

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?