Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sandbox/cgroup: improve cgroup-based process termination algorithm #14513

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ZeyadYasser
Copy link
Contributor

@ZeyadYasser ZeyadYasser commented Sep 16, 2024

unify termination algorithm for v1/v2

  • for each snap cgroup:
    • while cgroup.procs is not empty:
      • SIGKILL each pid in cgroup.procs
  • for v1 only, also kill pids found in freezer cgroup created by snap-confine
    • this is relevant for systemd v237 (used in ubuntu 18.04) for non-root users where the transient scope cgroups are not created

This logic drops the freeze/kill/thaw approach with all the weird v1/v2/kernel backward compatibility.

This PR should also fix flakey results seen in tests/main/snap-remove-terminate.

@ZeyadYasser ZeyadYasser added this to the 2.66 milestone Sep 16, 2024
@ZeyadYasser
Copy link
Contributor Author

Something is going wrong in the new approach, need to investigate why cgroup.procs reports no such device instead of file does not exist.

+ snap remove --terminate test-snapd-sh
error: cannot perform the following tasks:
- Kill running snap "test-snapd-sh" apps (read /sys/fs/cgroup/user.slice/user-0.slice/[email protected]/app.slice/snap.test-snapd-sh.sh-0e68ddf6-d14f-4c3c-9359-94bf510a9b8d.scope/cgroup.procs: no such device)

@ZeyadYasser
Copy link
Contributor Author

Something is going wrong in the new approach, need to investigate why cgroup.procs reports no such device instead of file does not exist.

+ snap remove --terminate test-snapd-sh
error: cannot perform the following tasks:
- Kill running snap "test-snapd-sh" apps (read /sys/fs/cgroup/user.slice/user-0.slice/[email protected]/app.slice/snap.test-snapd-sh.sh-0e68ddf6-d14f-4c3c-9359-94bf510a9b8d.scope/cgroup.procs: no such device)

Found root cause of ENODEV:

It is kernfs file operations behavior where it checks that the file is still there before continuing. It should be safe to skip errors for ENODEV similar to skipping on fs.ErrNotExist.

What triggers ENODEV is starting opening/reading cgroup.procs and have systemd remove the cgroup at the same time while the kernel operation is ongoing.

//
// 1. Cgroup v2 freezer was only available since Linux 5.2 so freezing is a no-op before 5.2 which allows processes to keep forking.
// 2. Freezing does not put processes in an uninterruptable sleep unlike v1, so they can be killed externally and have their pid reused.
// 3. `cgroup.kill` was introduced in Linux 5.14 and solves the above issues as it kills the cgroup processes atomically.
func killSnapProcessesImplV2(ctx context.Context, snapName string) error {
killCgroupProcs := func(dir string) error {
// Use cgroup.kill if it exists (requires linux 5.14+)
err := writeExistingFile(filepath.Join(dir, "cgroup.kill"), []byte("1"))
if err == nil || !errors.Is(err, fs.ErrNotExist) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given ENODEV appearing in certain scenarios, I wonder if it is possible for it to show up here as well. When looking at the kernel, could ENODEV be returned for every cgroup meta-file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I just checked and weirdly enough, No.

https://elixir.bootlin.com/linux/v6.11/source/kernel/cgroup/cgroup.c#L4030, but I think we should a check for good measure if kernel behavior decided to change later.

It is surprising to see ENOENT/ENODEV thrown everywhere like that without distinction in the kernel source code.

// pids read earlier.
// Note: When cgroup v1 is detected, the call will also act on the freezer
// group created when a snap process was started to address a known bug on
// systemd v327 for non-root users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth mentioning that this is only useful for killing apps or processes which do not have their lifecycle managed by external entities like systemd

// Keep sending SIGKILL signals until no more pids are left in cgroup
// to cover the case where a process forks before we kill it.
for {
// XXX: Should this have maximum retries?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we could have a spread test which essential does a fork bomb inside the snap, and with this code we should still arrive at a stable state when the cgorup is empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Great idea. I added a fork-bomb test variant. I crashed my machine twice making this 😅

@bboozzoo
Copy link
Contributor

needs a rebase now

tests/lib/snaps/fork-bomb/meta/snap.yaml Show resolved Hide resolved
license: GPL-3.0
apps:
fork-bomb:
command: bin/bomb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add another app called sh with a trivial script which does exec and then there'd be no need to install test-snapd-sh (which also pulls in core IIRC)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, the test is leaner now. Thanks!

unify termination algorithm for v1/v2
- for each snap cgroup:
  - while cgroup.procs is not empty:
    - SIGKILL each pid in cgroup.procs
- for v1 only, also kill pids found in freezer cgroup created by snap-confine
  - this is relevant for systemd v237 (used in ubuntu 18.04) for non-root users where the transient scope cgroups are not created

This logic drops the freeze/kill/thaw approach with all the weird v1/v2/kernel backward compatibility.

Signed-off-by: Zeyad Gouda <[email protected]>
This test variant stress-tests the new algorithm where snapd could be racing
after a fork bomb without doing freezing first by continuously killing pids
that show up until all pids are drained from cgroup.

Signed-off-by: Zeyad Gouda <[email protected]>
@ZeyadYasser ZeyadYasser force-pushed the improve-cgroup-based-process-termination branch from 36bb459 to 55b74c3 Compare September 19, 2024 12:00
Copy link
Contributor

@bboozzoo bboozzoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sh_snap_bin="$(command -v fork-bomb.sh)"
if [ "$BAD_SNAP" = "fork-bomb" ]; then
#shellcheck disable=SC2016
systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that it doesn't blow up too much in the tests:

Suggested change
systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb'
systemd-run --unit test-kill.service -p TasksMax=1000 flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to only apply to the outer systemd-run, but snapd's systemd-run escapes the outer cgroup limits imposed by TasksMax.

Interestingly, I think we accidentally just discovered that snaps don't behave normally when run as part of systemd directly because snap run calls systemd-run which create a separate unit/cgroup.

Copy link
Member

@andrewphelpsj andrewphelpsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some suggestions (and one open question), but looks good.

return err
}
return nil
}

var firstErr error
skipError := func(err error) bool {
if !errors.Is(err, fs.ErrNotExist) && firstErr == nil {
// fs.ErrNotExist and ENODEV are ignored in case the cgroup went away while we were
// processing the cgorup. ENODEV is returned by the kernel if the cgroup went
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// processing the cgorup. ENODEV is returned by the kernel if the cgroup went
// processing the cgroup. ENODEV is returned by the kernel if the cgroup went

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thank you!

Comment on lines 34 to 37
else
#shellcheck disable=SC2016
systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; sleep 100000'
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this other variant have much value now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I see it, the normal variant tests that the feature works in the normal case and the other fork-bomb variant stresses a specific part of the code responsible for handling such processes. I like how the fork-bomb variant isolates where to look in case of a regression.

// fs.ErrNotExist and ENODEV are ignored in case the cgroup went away while we were
// processing the cgorup. ENODEV is returned by the kernel if the cgroup went
// away while a kernfs operation is ongoing.
if !errors.Is(err, fs.ErrNotExist) && !errors.Is(err, syscall.ENODEV) && firstErr == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the open man page:

ENODEV pathname refers to a device special file and no corresponding device exists.  (This is a Linux kernel bug; in this situation ENXIO must be returned.)
ENXIO  The file is a device special file and no corresponding device exists.

I wonder if we should handle ENXIO too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok as is. The error handling branch is tailored to handle the error cases for which the overall operation of killing processes can be considered to have succeeded. Specifically ENOENT and then to account for the cgroup v1 prededent ENODEV. I do not see any evidence of ENXIO use under kernel/cgroup in the kernel tree, so I think it's ok for the code to fail explicitly should it even occur.

Copy link

codecov bot commented Sep 19, 2024

Codecov Report

Attention: Patch coverage is 90.47619% with 4 lines in your changes missing coverage. Please review.

Project coverage is 78.85%. Comparing base (04d0ab0) to head (c9ef741).
Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
sandbox/cgroup/kill.go 90.47% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #14513      +/-   ##
==========================================
+ Coverage   78.83%   78.85%   +0.02%     
==========================================
  Files        1078     1079       +1     
  Lines      145096   145467     +371     
==========================================
+ Hits       114389   114714     +325     
- Misses      23546    23583      +37     
- Partials     7161     7170       +9     
Flag Coverage Δ
unittests 78.85% <90.47%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants