You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On November 9 we experienced an outage on Charmed Microk8s 1.28/stable when the snap automatically refreshed from 1.28.14 to 1.28.15. Several of the kube-system pods restarted and failed to come up, either in pending or crashloopbackoff. These were dispersed across all 3 nodes in the cluster, so it was not just one with the issue. We also noticed all the calico pods were deleted and recreated (not just restarted). They also failed to come up, which could be what caused issues with the other pods.
What Should Happen Instead?
Patch updates should not cause service disruptions.
Reproduction Steps
Cannot reproduce since this was an automatic snap update.
After the refresh, I see many "task not found" errors like these:
Nov 09 17:36:47 microk8s-1 microk8s.daemon-kubelite[1826546]: E1109 23:36:47.015909 1826546 manager.go:1106] Failed to create existing container: /kubepods/besteffort/pod376b187f-f1cb-426d-b39d-130687311b1d/a7d239e3b3b5933c4e5cee3a21da35ea653acbbc8c4e604dd9852309cd89e508: task a7d239e3b3b5933c4e5cee3a21da35ea653acbbc8c4e604dd9852309cd89e508 not found: not found
Nov 09 17:36:50 microk8s-1 microk8s.daemon-kubelite[1826546]: E1109 23:36:50.128747 1826546 manager.go:1106] Failed to create existing container: /kubepods/besteffort/pod376b187f-f1cb-426d-b39d-130687311b1d/84ab7836363e64c15484a9eda4e0aad4fdcaff7b4b08d42a1fe161e43631a5a3: task 84ab7836363e64c15484a9eda4e0aad4fdcaff7b4b08d42a1fe161e43631a5a3 not found: not found
Can you suggest a fix?
Rebooting the nodes one-by-one resolved the issue.
Summary
On November 9 we experienced an outage on Charmed Microk8s 1.28/stable when the snap automatically refreshed from 1.28.14 to 1.28.15. Several of the kube-system pods restarted and failed to come up, either in
pending
orcrashloopbackoff
. These were dispersed across all 3 nodes in the cluster, so it was not just one with the issue. We also noticed all the calico pods were deleted and recreated (not just restarted). They also failed to come up, which could be what caused issues with the other pods.What Should Happen Instead?
Patch updates should not cause service disruptions.
Reproduction Steps
Cannot reproduce since this was an automatic snap update.
Introspection Report
We did not collect one before resolving the incident, but this is the journal log from around the time of the refresh (17:07): https://pastebin.canonical.com/p/rV43qDMPw2/
After the refresh, I see many "task not found" errors like these:
Can you suggest a fix?
Rebooting the nodes one-by-one resolved the issue.
Are you interested in contributing with a fix?
@ktsakalozos
The text was updated successfully, but these errors were encountered: