Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gvisor pod cannot be terminated properly #417

Open
frezbo opened this issue Jun 18, 2024 · 20 comments
Open

Gvisor pod cannot be terminated properly #417

frezbo opened this issue Jun 18, 2024 · 20 comments
Assignees

Comments

@frezbo
Copy link
Member

frezbo commented Jun 18, 2024

The Gvisor test pod used in talos e2e-extensions test never terminates succesfully, this causes the reboot/shutdown sequence to hang and eventually timeout, the kubelet shows failed to delete pod sandbox error. Gvisor test is going to be disabled until this is addressed.

@frezbo
Copy link
Member Author

frezbo commented Jun 18, 2024

Ref: siderolabs/talos#8905

@frezbo
Copy link
Member Author

frezbo commented Jun 24, 2024

Upstream issue: google/gvisor#9834 (comment)

@SISheogorath
Copy link
Contributor

Just hit this after upgrading to Talos 1.8.0

@BobyMCbobs
Copy link
Contributor

Also have been experiencing this

@frezbo
Copy link
Member Author

frezbo commented Oct 16, 2024

Gvisor is still broken with talos main

 Warning  FailedKillPod           17s    kubelet            error killing pod: failed to "KillPodSandbox" for "01ee1caf-9da0-40af-a663-5408d37d8a0e" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"

@smira
Copy link
Member

smira commented Oct 16, 2024

Can you try with gvisor-debug and debug containerd logs so that we can capture more?

@frezbo
Copy link
Member Author

frezbo commented Oct 16, 2024

Seems when adding gvisor debug it's still using the runsc.toml from gvisor instead of gvisor-debug:

❯ talosctl -n 10.5.0.3 read /etc/cri/conf.d/gvisor-debug.part
[debug]
  level = "debug"
[plugins."io.containerd.runtime.v1.linux"]
  shim_debug = true

❯ talosctl -n 10.5.0.3 read /etc/cri/conf.d/runsc.toml 
[runsc_config]

❯ talosctl -n 10.5.0.3 get extensions
WARNING: 10.5.0.3: server version 1.8.0-alpha.2-70-ga9bff3a1d-dirty is older than client version 1.8.1
NODE       NAMESPACE   TYPE              ID   VERSION   NAME           VERSION
10.5.0.3   runtime     ExtensionStatus   0    1         gvisor-debug   v1.0.0
10.5.0.3   runtime     ExtensionStatus   1    1         gvisor         20240826.0

@frezbo
Copy link
Member Author

frezbo commented Oct 16, 2024

not sure why

@smira
Copy link
Member

smira commented Oct 16, 2024

I wonder if that's the order of extensions?

@SISheogorath
Copy link
Contributor

I think we should integrate gvisor debug with the general gvisor extension and just add them as additional runtimes.

They remain unusable unless someone configured a runtimeclass for debugging and help to reduce the overhead we see here right now.

@frezbo
Copy link
Member Author

frezbo commented Oct 16, 2024

attaching support zip and runsc logs
support.zip
runsc.tar.gz

@smira
Copy link
Member

smira commented Oct 17, 2024

I don't see any errors in the logs you posted so far.

@frezbo
Copy link
Member Author

frezbo commented Oct 17, 2024

yeh, that's the thing, it's just the pod fails to terminate

@SISheogorath
Copy link
Contributor

SISheogorath commented Oct 17, 2024

I'm quite sure it's a containerd vs gvisor-shim problem.

Given how many breaking changes containerd v2 introduced in that space:

https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#whats-breaking

It's probably broken from here:
https://github.com/google/gvisor/blob/abe38d82ac3634264608259d1c60003cdd53658a/shim/cli/cli.go#L27

As it's called out in containerd v2 as removed here: https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#iocontainerdruntimev1linux-and-iocontainerdruncv1-have-been-removed

@frezbo
Copy link
Member Author

frezbo commented Oct 18, 2024

I'm quite sure it's a containerd vs gvisor-shim problem.

Given how many breaking changes containerd v2 introduced in that space:

https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#whats-breaking

It's probably broken from here: https://github.com/google/gvisor/blob/abe38d82ac3634264608259d1c60003cdd53658a/shim/cli/cli.go#L27

As it's called out in containerd v2 as removed here: https://github.com/containerd/containerd/blob/main/docs/containerd-2.0.md#iocontainerdruntimev1linux-and-iocontainerdruncv1-have-been-removed

would you like to create an upstream issue then?

@smira
Copy link
Member

smira commented Oct 18, 2024

I think containerd removed it's own runc.v1 shim, totally unrelated to gvisor, but still there might some issue of course.

@frezbo
Copy link
Member Author

frezbo commented Oct 24, 2024

containerd issue: containerd/containerd#10891

@frezbo
Copy link
Member Author

frezbo commented Dec 18, 2024

New gvisor issue here: google/gvisor#11308

@ayushr2
Copy link

ayushr2 commented Dec 18, 2024

I'm quite sure it's a containerd vs gvisor-shim problem.
Given how many breaking changes containerd v2 introduced in that space:

@SISheogorath @smira I saw that containerd v2.0.1 was released just 5 days back: https://github.com/containerd/containerd/releases/tag/v2.0.1.

Have you been on containerd v2 from before that? From your investigation in google/gvisor#11308 (comment), you intuition does feel correct. Something at the shim level is misbehaving (i.e. the shim is not being invoked like its expecting to be).

@SISheogorath
Copy link
Contributor

@ayushr2 Yes, starting from Talos Linux v1.8.0 containerd v2 was used, first RCs, with v1.8.3 containerd v2 became stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants