You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pods randomly get stuck in Terminating state in 1.8.0, It doesn't happen for every pod, but it happens often enough to get a backlog of pods.
Description
After updating my cluster to v1.8.0, I noticed that pods very often get stuck in terminating status. While force deleting the pods CAN work for some pods. Pods with PVCs result in the volume attachments not getting cleaned up and then the safest way to recover is to restart the node(s).
Not sure, but reverting to 1.7.7 definitely resolves it since I did a little experiment where I left one node running on v1.8.0, and only that node continued to generate stuck pods.
Logs
Excerpt from CRI logs:
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"9d9f825fe87a8120d6945989840468d0fc266858ab64734c9741743a8ccda9d6\" id:\"066707c9890d866a4e2ca5452018082e735eeb09b08e42ed82c4b1331d0782e2\" pid:150265 exited_at:{seconds:1728135418 nanos:494801087}","time":"2024-10-05T13:36:58.495188602Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"4e121708fc549304dad3a5b85a40a8d81afca2543b0d48b66e0b1efced902c3e\" id:\"db2adab2c1ef0cbb073ae975c9387373236cfc374d822041085fe1b707af74dd\" pid:150284 exited_at:{seconds:1728135418 nanos:495655288}","time":"2024-10-05T13:36:58.495934314Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"600e036a03c1a8a0665345fe62d51bfad21f08a7ca8c53f4bf28a30fe309c50d\" id:\"ce32d5078eceba0d526e0e7f08ed105c243c91a9d87efd9b7d3629cc03ab2412\" pid:150276 exited_at:{seconds:1728135418 nanos:495647473}","time":"2024-10-05T13:36:58.496048533Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = DeadlineExceeded desc = failed to stop sandbox \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\": failed to stop sandbox container \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\" in \"SANDBOX_READY\" state: context deadline exceeded","level":"error","msg":"StopPodSandbox for \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\" failed","time":"2024-10-05T13:36:59.964361741Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"StopPodSandbox for \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\"","time":"2024-10-05T13:37:00.395184636Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Container to stop \"8b22cf3f0ca6b98bcfba470db6caebbc34e71dc5114231e9c4e854b30a4a1cf2\" must be in running or unknown state, current state \"CONTAINER_EXITED\"","time":"2024-10-05T13:37:00.395322440Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5","time":"2024-10-05T13:37:00.812334019Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35","time":"2024-10-05T13:37:01.036541035Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8","time":"2024-10-05T13:37:01.040340873Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed","time":"2024-10-05T13:37:01.062457311Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8","time":"2024-10-05T13:37:02.663179695Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed","time":"2024-10-05T13:37:02.681697009Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5","time":"2024-10-05T13:37:02.747565408Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35","time":"2024-10-05T13:37:03.080101532Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = Canceled desc = an error occurs during waiting for container \"28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5\" to be killed: wait container \"28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5\": context canceled","level":"error","msg":"StopContainer for \"28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5\" failed","time":"2024-10-05T13:37:03.971497412Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = Canceled desc = an error occurs during waiting for container \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\" to be killed: wait container \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\": context canceled","level":"error","msg":"StopContainer for \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\" failed","time":"2024-10-05T13:37:03.971525806Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8\" to be killed: wait container \"a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8\": context deadline exceeded","level":"error","msg":"StopContainer for \"a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8\" failed","time":"2024-10-05T13:37:04.971540932Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35\" to be killed: wait container \"3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35\": context deadline exceeded","level":"error","msg":"StopContainer for \"3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35\" failed","time":"2024-10-05T13:37:04.971587842Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"StopPodSandbox for \"82de1b7fdbdb08e91d6210188b9da8e8dde20101a1350f60f5e66f8c5c42e5ae\"","time":"2024-10-05T13:37:04.972100586Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Container to stop \"dbb1e042b85e88d74802dd3804a8b674be55cf64b487eb1f80756f066a8b86a9\" must be in running or unknown state, current state \"CONTAINER_EXITED\"","time":"2024-10-05T13:37:04.972230305Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Kill container \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\"","time":"2024-10-05T13:37:04.972741216Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = Canceled desc = failed to stop sandbox \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\": failed to stop sandbox container \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\" in \"SANDBOX_READY\" state: context canceled","level":"error","msg":"StopPodSandbox for \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\" failed","time":"2024-10-05T13:37:06.988683839Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"StopPodSandbox for \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\"","time":"2024-10-05T13:37:07.422971089Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Container to stop \"cdcf58781bd5b6ee964489813432650e7e1cc1d55e71d18f21462fcf96bc681c\" must be in running or unknown state, current state \"CONTAINER_EXITED\"","time":"2024-10-05T13:37:07.423119823Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"9d9f825fe87a8120d6945989840468d0fc266858ab64734c9741743a8ccda9d6\" id:\"7fbf1a93b093029c6830c2286242ca1e5c5731d88e4e9ec48a1430d386b6a9f6\" pid:150399 exited_at:{seconds:1728135428 nanos:443380713}","time":"2024-10-05T13:37:08.443902426Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"600e036a03c1a8a0665345fe62d51bfad21f08a7ca8c53f4bf28a30fe309c50d\" id:\"1c0c9ec0d47c307238eab20ba44225be040d9b9a2ea983c2a43784c72ac82579\" pid:150403 exited_at:{seconds:1728135428 nanos:450281678}","time":"2024-10-05T13:37:08.450609227Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"4e121708fc549304dad3a5b85a40a8d81afca2543b0d48b66e0b1efced902c3e\" id:\"b0e57a2e14131b1da3dc0f0afb03341d2c6169d477065fa99b1a0c3fcae91e09\" pid:150417 exited_at:{seconds:1728135428 nanos:452051427}","time":"2024-10-05T13:37:08.452379236Z"}
Environment
Talos version: v1.8.0
Kubernetes version: 1.31.1
Platform: nocloud (proxmox)
The text was updated successfully, but these errors were encountered:
Bug Report
Pods randomly get stuck in Terminating state in 1.8.0, It doesn't happen for every pod, but it happens often enough to get a backlog of pods.
Description
After updating my cluster to v1.8.0, I noticed that pods very often get stuck in terminating status. While force deleting the pods CAN work for some pods. Pods with PVCs result in the volume attachments not getting cleaned up and then the safest way to recover is to restart the node(s).
Suspect issue might be related to containerd/containerd#10727 since v1.8.0 uses containerd 2.0.0.
Or maybe containerd/containerd#10755.
Not sure, but reverting to 1.7.7 definitely resolves it since I did a little experiment where I left one node running on v1.8.0, and only that node continued to generate stuck pods.
Logs
Excerpt from CRI logs:
Environment
The text was updated successfully, but these errors were encountered: