-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sandbox/cgroup: improve cgroup-based process termination algorithm #14513
base: master
Are you sure you want to change the base?
sandbox/cgroup: improve cgroup-based process termination algorithm #14513
Conversation
Something is going wrong in the new approach, need to investigate why cgroup.procs reports
|
c6a5813
to
371e0eb
Compare
Found root cause of
It is kernfs file operations behavior where it checks that the file is still there before continuing. It should be safe to skip errors for What triggers |
sandbox/cgroup/kill.go
Outdated
// | ||
// 1. Cgroup v2 freezer was only available since Linux 5.2 so freezing is a no-op before 5.2 which allows processes to keep forking. | ||
// 2. Freezing does not put processes in an uninterruptable sleep unlike v1, so they can be killed externally and have their pid reused. | ||
// 3. `cgroup.kill` was introduced in Linux 5.14 and solves the above issues as it kills the cgroup processes atomically. | ||
func killSnapProcessesImplV2(ctx context.Context, snapName string) error { | ||
killCgroupProcs := func(dir string) error { | ||
// Use cgroup.kill if it exists (requires linux 5.14+) | ||
err := writeExistingFile(filepath.Join(dir, "cgroup.kill"), []byte("1")) | ||
if err == nil || !errors.Is(err, fs.ErrNotExist) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given ENODEV appearing in certain scenarios, I wonder if it is possible for it to show up here as well. When looking at the kernel, could ENODEV be returned for every cgroup meta-file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I just checked and weirdly enough, No.
https://elixir.bootlin.com/linux/v6.11/source/kernel/cgroup/cgroup.c#L4030, but I think we should a check for good measure if kernel behavior decided to change later.
It is surprising to see ENOENT/ENODEV thrown everywhere like that without distinction in the kernel source code.
sandbox/cgroup/kill.go
Outdated
// pids read earlier. | ||
// Note: When cgroup v1 is detected, the call will also act on the freezer | ||
// group created when a snap process was started to address a known bug on | ||
// systemd v327 for non-root users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth mentioning that this is only useful for killing apps or processes which do not have their lifecycle managed by external entities like systemd
// Keep sending SIGKILL signals until no more pids are left in cgroup | ||
// to cover the case where a process forks before we kill it. | ||
for { | ||
// XXX: Should this have maximum retries? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we could have a spread test which essential does a fork bomb inside the snap, and with this code we should still arrive at a stable state when the cgorup is empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Great idea. I added a fork-bomb test variant. I crashed my machine twice making this 😅
needs a rebase now |
license: GPL-3.0 | ||
apps: | ||
fork-bomb: | ||
command: bin/bomb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add another app called sh
with a trivial script which does exec and then there'd be no need to install test-snapd-sh (which also pulls in core IIRC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, the test is leaner now. Thanks!
unify termination algorithm for v1/v2 - for each snap cgroup: - while cgroup.procs is not empty: - SIGKILL each pid in cgroup.procs - for v1 only, also kill pids found in freezer cgroup created by snap-confine - this is relevant for systemd v237 (used in ubuntu 18.04) for non-root users where the transient scope cgroups are not created This logic drops the freeze/kill/thaw approach with all the weird v1/v2/kernel backward compatibility. Signed-off-by: Zeyad Gouda <[email protected]>
Signed-off-by: Zeyad Gouda <[email protected]>
This test variant stress-tests the new algorithm where snapd could be racing after a fork bomb without doing freezing first by continuously killing pids that show up until all pids are drained from cgroup. Signed-off-by: Zeyad Gouda <[email protected]>
Signed-off-by: Zeyad Gouda <[email protected]>
36bb459
to
55b74c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sh_snap_bin="$(command -v fork-bomb.sh)" | ||
if [ "$BAD_SNAP" = "fork-bomb" ]; then | ||
#shellcheck disable=SC2016 | ||
systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that it doesn't blow up too much in the tests:
systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb' | |
systemd-run --unit test-kill.service -p TasksMax=1000 flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to only apply to the outer systemd-run, but snapd's systemd-run escapes the outer cgroup limits imposed by TasksMax.
Interestingly, I think we accidentally just discovered that snaps don't behave normally when run as part of systemd directly because snap run
calls systemd-run which create a separate unit/cgroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Some suggestions (and one open question), but looks good.
sandbox/cgroup/kill.go
Outdated
return err | ||
} | ||
return nil | ||
} | ||
|
||
var firstErr error | ||
skipError := func(err error) bool { | ||
if !errors.Is(err, fs.ErrNotExist) && firstErr == nil { | ||
// fs.ErrNotExist and ENODEV are ignored in case the cgroup went away while we were | ||
// processing the cgorup. ENODEV is returned by the kernel if the cgroup went |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// processing the cgorup. ENODEV is returned by the kernel if the cgroup went | |
// processing the cgroup. ENODEV is returned by the kernel if the cgroup went |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, thank you!
else | ||
#shellcheck disable=SC2016 | ||
systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; sleep 100000' | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this other variant have much value now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it, the normal variant tests that the feature works in the normal case and the other fork-bomb variant stresses a specific part of the code responsible for handling such processes. I like how the fork-bomb variant isolates where to look in case of a regression.
// fs.ErrNotExist and ENODEV are ignored in case the cgroup went away while we were | ||
// processing the cgorup. ENODEV is returned by the kernel if the cgroup went | ||
// away while a kernfs operation is ongoing. | ||
if !errors.Is(err, fs.ErrNotExist) && !errors.Is(err, syscall.ENODEV) && firstErr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the open
man page:
ENODEV pathname refers to a device special file and no corresponding device exists. (This is a Linux kernel bug; in this situation ENXIO must be returned.)
ENXIO The file is a device special file and no corresponding device exists.
I wonder if we should handle ENXIO too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok as is. The error handling branch is tailored to handle the error cases for which the overall operation of killing processes can be considered to have succeeded. Specifically ENOENT and then to account for the cgroup v1 prededent ENODEV. I do not see any evidence of ENXIO use under kernel/cgroup in the kernel tree, so I think it's ok for the code to fail explicitly should it even occur.
Signed-off-by: Zeyad Gouda <[email protected]>
Signed-off-by: Zeyad Gouda <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14513 +/- ##
==========================================
+ Coverage 78.83% 78.85% +0.02%
==========================================
Files 1078 1079 +1
Lines 145096 145467 +371
==========================================
+ Hits 114389 114714 +325
- Misses 23546 23583 +37
- Partials 7161 7170 +9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
unify termination algorithm for v1/v2
This logic drops the freeze/kill/thaw approach with all the weird v1/v2/kernel backward compatibility.
This PR should also fix flakey results seen in
tests/main/snap-remove-terminate
.