Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods get stuck in terminating state #9454

Closed
ruifung opened this issue Oct 5, 2024 · 2 comments
Closed

Pods get stuck in terminating state #9454

ruifung opened this issue Oct 5, 2024 · 2 comments

Comments

@ruifung
Copy link

ruifung commented Oct 5, 2024

Bug Report

Pods randomly get stuck in Terminating state in 1.8.0, It doesn't happen for every pod, but it happens often enough to get a backlog of pods.

Description

After updating my cluster to v1.8.0, I noticed that pods very often get stuck in terminating status. While force deleting the pods CAN work for some pods. Pods with PVCs result in the volume attachments not getting cleaned up and then the safest way to recover is to restart the node(s).

Suspect issue might be related to containerd/containerd#10727 since v1.8.0 uses containerd 2.0.0.
Or maybe containerd/containerd#10755.

Not sure, but reverting to 1.7.7 definitely resolves it since I did a little experiment where I left one node running on v1.8.0, and only that node continued to generate stuck pods.

Logs

Excerpt from CRI logs:

talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"9d9f825fe87a8120d6945989840468d0fc266858ab64734c9741743a8ccda9d6\" id:\"066707c9890d866a4e2ca5452018082e735eeb09b08e42ed82c4b1331d0782e2\" pid:150265 exited_at:{seconds:1728135418 nanos:494801087}","time":"2024-10-05T13:36:58.495188602Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"4e121708fc549304dad3a5b85a40a8d81afca2543b0d48b66e0b1efced902c3e\" id:\"db2adab2c1ef0cbb073ae975c9387373236cfc374d822041085fe1b707af74dd\" pid:150284 exited_at:{seconds:1728135418 nanos:495655288}","time":"2024-10-05T13:36:58.495934314Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"600e036a03c1a8a0665345fe62d51bfad21f08a7ca8c53f4bf28a30fe309c50d\" id:\"ce32d5078eceba0d526e0e7f08ed105c243c91a9d87efd9b7d3629cc03ab2412\" pid:150276 exited_at:{seconds:1728135418 nanos:495647473}","time":"2024-10-05T13:36:58.496048533Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = DeadlineExceeded desc = failed to stop sandbox \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\": failed to stop sandbox container \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\" in \"SANDBOX_READY\" state: context deadline exceeded","level":"error","msg":"StopPodSandbox for \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\" failed","time":"2024-10-05T13:36:59.964361741Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"StopPodSandbox for \"f9e9696c3c5396bc7a8a393cbec06a292db3fe04768e9932005605f860821473\"","time":"2024-10-05T13:37:00.395184636Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Container to stop \"8b22cf3f0ca6b98bcfba470db6caebbc34e71dc5114231e9c4e854b30a4a1cf2\" must be in running or unknown state, current state \"CONTAINER_EXITED\"","time":"2024-10-05T13:37:00.395322440Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5","time":"2024-10-05T13:37:00.812334019Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35","time":"2024-10-05T13:37:01.036541035Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8","time":"2024-10-05T13:37:01.040340873Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed","time":"2024-10-05T13:37:01.062457311Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8","time":"2024-10-05T13:37:02.663179695Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed","time":"2024-10-05T13:37:02.681697009Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5","time":"2024-10-05T13:37:02.747565408Z"}
talos-worker-big-2.servers.internal: {"error":"cannot stat a stopped container/process: unknown","level":"error","msg":"collecting metrics for 3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35","time":"2024-10-05T13:37:03.080101532Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = Canceled desc = an error occurs during waiting for container \"28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5\" to be killed: wait container \"28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5\": context canceled","level":"error","msg":"StopContainer for \"28f04938da0a99c774207aa57acd2f456b8ad1c12dc12311b37bc1380e3457f5\" failed","time":"2024-10-05T13:37:03.971497412Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = Canceled desc = an error occurs during waiting for container \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\" to be killed: wait container \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\": context canceled","level":"error","msg":"StopContainer for \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\" failed","time":"2024-10-05T13:37:03.971525806Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8\" to be killed: wait container \"a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8\": context deadline exceeded","level":"error","msg":"StopContainer for \"a96e7dc6ce5cae0af965d63a71238f78932e00520f8103fc24850db7a7b94bb8\" failed","time":"2024-10-05T13:37:04.971540932Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35\" to be killed: wait container \"3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35\": context deadline exceeded","level":"error","msg":"StopContainer for \"3024c43549c33fa54cb9e128e853c92be71ef315ed6aae7cc2adfc4ebef5da35\" failed","time":"2024-10-05T13:37:04.971587842Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"StopPodSandbox for \"82de1b7fdbdb08e91d6210188b9da8e8dde20101a1350f60f5e66f8c5c42e5ae\"","time":"2024-10-05T13:37:04.972100586Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Container to stop \"dbb1e042b85e88d74802dd3804a8b674be55cf64b487eb1f80756f066a8b86a9\" must be in running or unknown state, current state \"CONTAINER_EXITED\"","time":"2024-10-05T13:37:04.972230305Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Kill container \"560d4f196466af6475e45494d37cf186c1d8746f12cc1f715812b923ca4388ed\"","time":"2024-10-05T13:37:04.972741216Z"}
talos-worker-big-2.servers.internal: {"error":"rpc error: code = Canceled desc = failed to stop sandbox \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\": failed to stop sandbox container \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\" in \"SANDBOX_READY\" state: context canceled","level":"error","msg":"StopPodSandbox for \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\" failed","time":"2024-10-05T13:37:06.988683839Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"StopPodSandbox for \"95fbb6f35809189e9f6579b591a61a1bf68033203adab4bbbaa5c4f09645b6c6\"","time":"2024-10-05T13:37:07.422971089Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"Container to stop \"cdcf58781bd5b6ee964489813432650e7e1cc1d55e71d18f21462fcf96bc681c\" must be in running or unknown state, current state \"CONTAINER_EXITED\"","time":"2024-10-05T13:37:07.423119823Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"9d9f825fe87a8120d6945989840468d0fc266858ab64734c9741743a8ccda9d6\" id:\"7fbf1a93b093029c6830c2286242ca1e5c5731d88e4e9ec48a1430d386b6a9f6\" pid:150399 exited_at:{seconds:1728135428 nanos:443380713}","time":"2024-10-05T13:37:08.443902426Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"600e036a03c1a8a0665345fe62d51bfad21f08a7ca8c53f4bf28a30fe309c50d\" id:\"1c0c9ec0d47c307238eab20ba44225be040d9b9a2ea983c2a43784c72ac82579\" pid:150403 exited_at:{seconds:1728135428 nanos:450281678}","time":"2024-10-05T13:37:08.450609227Z"}
talos-worker-big-2.servers.internal: {"level":"info","msg":"TaskExit event in podsandbox handler container_id:\"4e121708fc549304dad3a5b85a40a8d81afca2543b0d48b66e0b1efced902c3e\" id:\"b0e57a2e14131b1da3dc0f0afb03341d2c6169d477065fa99b1a0c3fcae91e09\" pid:150417 exited_at:{seconds:1728135428 nanos:452051427}","time":"2024-10-05T13:37:08.452379236Z"}

Environment

  • Talos version: v1.8.0
  • Kubernetes version: 1.31.1
  • Platform: nocloud (proxmox)
@smira
Copy link
Member

smira commented Oct 7, 2024

Thanks for reporting this issue, if it's containerd/containerd#10727, it's in v2.0.0-rc.5, which will be included in Talos 1.8.1

@ruifung
Copy link
Author

ruifung commented Nov 12, 2024

Actually, I think it's siderolabs/extensions#417

@ruifung ruifung closed this as completed Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants