Pods get stuck in terminating state #9454

ruifung · 2024-10-05T13:49:32Z

Bug Report

Pods randomly get stuck in Terminating state in 1.8.0, It doesn't happen for every pod, but it happens often enough to get a backlog of pods.

Description

After updating my cluster to v1.8.0, I noticed that pods very often get stuck in terminating status. While force deleting the pods CAN work for some pods. Pods with PVCs result in the volume attachments not getting cleaned up and then the safest way to recover is to restart the node(s).

Suspect issue might be related to containerd/containerd#10727 since v1.8.0 uses containerd 2.0.0.
Or maybe containerd/containerd#10755.

Not sure, but reverting to 1.7.7 definitely resolves it since I did a little experiment where I left one node running on v1.8.0, and only that node continued to generate stuck pods.

Logs

Excerpt from CRI logs:

talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"9d9f825fe87a8120d6945989840468d0fc266858ab64734c9741743a8ccda9d6\" id:\"066707c9890d866a4e2ca5452018082e735eeb09b08e42ed82c4b1331d0782e2\" pid:150265 exited_at:{seconds:1728135418 nanos:494801087}","time":"2024-10-05T13:36:58.495188602Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"4e121708fc549304dad3a5b85a40a8d81afca2543b0d48b66e0b1efced902c3e\" id:\"db2adab2c1ef0cbb073ae975c9387373236cfc374d822041085fe1b707af74dd\" pid:150284 exited_at:{seconds:1728135418 nanos:495655288}","time":"2024-10-05T13:36:58.495934314Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"600e036a03c1a8a0665345fe62d51bfad21f08a7ca8c53f4bf28a30fe309c50d\" id:\"ce32d5078eceba0d526e0e7f08ed105c243c91a9d87efd9b7d3629cc03ab2412\" pid:150276 exited_at:{seconds:1728135418 nanos:495647473}","time":"2024-10-05T13:36:58.496048533Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = DeadlineExceeded desc = failed to stop sandbox \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\": failed to stop sandbox container \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\" in \"SANDBOX_READY\" state: context deadline exceeded","level":"error","msg":"StopPodSandbox for \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\" failed","time":"2024-10-05T13:36:59.964361741Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"StopPodSandbox for \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\"","time":"2024-10-05T13:37:00.395184636Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Container to stop \"8b22cf3f0ca6b98bcfba470db6caebbc34e71dc5114231e9c4e854b30a4a1cf2\" must be in running or unknown state, current state \"CONTAINER_EXITED\"","time":"2024-10-05T13:37:00.395322440Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5","time":"2024-10-05T13:37:00.812334019Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35","time":"2024-10-05T13:37:01.036541035Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8","time":"2024-10-05T13:37:01.040340873Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed","time":"2024-10-05T13:37:01.062457311Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8","time":"2024-10-05T13:37:02.663179695Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed","time":"2024-10-05T13:37:02.681697009Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5","time":"2024-10-05T13:37:02.747565408Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35","time":"2024-10-05T13:37:03.080101532Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = Canceled desc = an error occurs during waiting for container \"28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5\" to be killed: wait container \"28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5\": context canceled","level":"error","msg":"StopContainer for \"28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5\" failed","time":"2024-10-05T13:37:03.971497412Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = Canceled desc = an error occurs during waiting for container \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\" to be killed: wait container \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\": context canceled","level":"error","msg":"StopContainer for \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\" failed","time":"2024-10-05T13:37:03.971525806Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8\" to be killed: wait container \"a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8\": context deadline exceeded","level":"error","msg":"StopContainer for \"a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8\" failed","time":"2024-10-05T13:37:04.971540932Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35\" to be killed: wait container \"3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35\": context deadline exceeded","level":"error","msg":"StopContainer for \"3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35\" failed","time":"2024-10-05T13:37:04.971587842Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"StopPodSandbox for \"82de1b7fdbdb08e91d6210188b9da8e8dde20101a1350f60f5e66f8c5c42e5ae\"","time":"2024-10-05T13:37:04.972100586Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Container to stop \"dbb1e042b85e88d74802dd3804a8b674be55cf64b487eb1f80756f066a8b86a9\" must be in running or unknown state, current state \"CONTAINER_EXITED\"","time":"2024-10-05T13:37:04.972230305Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Kill container \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\"","time":"2024-10-05T13:37:04.972741216Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = Canceled desc = failed to stop sandbox \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\": failed to stop sandbox container \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\" in \"SANDBOX_READY\" state: context canceled","level":"error","msg":"StopPodSandbox for \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\" failed","time":"2024-10-05T13:37:06.988683839Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"StopPodSandbox for \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\"","time":"2024-10-05T13:37:07.422971089Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Container to stop \"cdcf58781bd5b6ee964489813432650e7e1cc1d55e71d18f21462fcf96bc681c\" must be in running or unknown state, current state \"CONTAINER_EXITED\"","time":"2024-10-05T13:37:07.423119823Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"9d9f825fe87a8120d6945989840468d0fc266858ab64734c9741743a8ccda9d6\" id:\"7fbf1a93b093029c6830c2286242ca1e5c5731d88e4e9ec48a1430d386b6a9f6\" pid:150399 exited_at:{seconds:1728135428 nanos:443380713}","time":"2024-10-05T13:37:08.443902426Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"600e036a03c1a8a0665345fe62d51bfad21f08a7ca8c53f4bf28a30fe309c50d\" id:\"1c0c9ec0d47c307238eab20ba44225be040d9b9a2ea983c2a43784c72ac82579\" pid:150403 exited_at:{seconds:1728135428 nanos:450281678}","time":"2024-10-05T13:37:08.450609227Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"4e121708fc549304dad3a5b85a40a8d81afca2543b0d48b66e0b1efced902c3e\" id:\"b0e57a2e14131b1da3dc0f0afb03341d2c6169d477065fa99b1a0c3fcae91e09\" pid:150417 exited_at:{seconds:1728135428 nanos:452051427}","time":"2024-10-05T13:37:08.452379236Z"}

Environment

Talos version: v1.8.0
Kubernetes version: 1.31.1
Platform: nocloud (proxmox)

The text was updated successfully, but these errors were encountered:

smira · 2024-10-07T09:44:21Z

Thanks for reporting this issue, if it's containerd/containerd#10727, it's in v2.0.0-rc.5, which will be included in Talos 1.8.1

ruifung · 2024-11-12T04:26:58Z

Actually, I think it's siderolabs/extensions#417

ruifung closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods get stuck in terminating state #9454

Pods get stuck in terminating state #9454

ruifung commented Oct 5, 2024 •

edited

Loading

smira commented Oct 7, 2024

ruifung commented Nov 12, 2024

Pods get stuck in terminating state #9454

Pods get stuck in terminating state #9454

Comments

ruifung commented Oct 5, 2024 • edited Loading

Bug Report

Description

Logs

Environment

smira commented Oct 7, 2024

ruifung commented Nov 12, 2024

ruifung commented Oct 5, 2024 •

edited

Loading