-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sandbox/cgroup: improve cgroup-based process termination algorithm #14513
sandbox/cgroup: improve cgroup-based process termination algorithm #14513
Conversation
Something is going wrong in the new approach, need to investigate why cgroup.procs reports
|
c6a5813
to
371e0eb
Compare
Found root cause of
It is kernfs file operations behavior where it checks that the file is still there before continuing. It should be safe to skip errors for What triggers |
sandbox/cgroup/kill.go
Outdated
// | ||
// 1. Cgroup v2 freezer was only available since Linux 5.2 so freezing is a no-op before 5.2 which allows processes to keep forking. | ||
// 2. Freezing does not put processes in an uninterruptable sleep unlike v1, so they can be killed externally and have their pid reused. | ||
// 3. `cgroup.kill` was introduced in Linux 5.14 and solves the above issues as it kills the cgroup processes atomically. | ||
func killSnapProcessesImplV2(ctx context.Context, snapName string) error { | ||
killCgroupProcs := func(dir string) error { | ||
// Use cgroup.kill if it exists (requires linux 5.14+) | ||
err := writeExistingFile(filepath.Join(dir, "cgroup.kill"), []byte("1")) | ||
if err == nil || !errors.Is(err, fs.ErrNotExist) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given ENODEV appearing in certain scenarios, I wonder if it is possible for it to show up here as well. When looking at the kernel, could ENODEV be returned for every cgroup meta-file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I just checked and weirdly enough, No.
https://elixir.bootlin.com/linux/v6.11/source/kernel/cgroup/cgroup.c#L4030, but I think we should a check for good measure if kernel behavior decided to change later.
It is surprising to see ENOENT/ENODEV thrown everywhere like that without distinction in the kernel source code.
sandbox/cgroup/kill.go
Outdated
// pids read earlier. | ||
// Note: When cgroup v1 is detected, the call will also act on the freezer | ||
// group created when a snap process was started to address a known bug on | ||
// systemd v327 for non-root users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth mentioning that this is only useful for killing apps or processes which do not have their lifecycle managed by external entities like systemd
sandbox/cgroup/kill.go
Outdated
// Keep sending SIGKILL signals until no more pids are left in cgroup | ||
// to cover the case where a process forks before we kill it. | ||
for { | ||
// XXX: Should this have maximum retries? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we could have a spread test which essential does a fork bomb inside the snap, and with this code we should still arrive at a stable state when the cgorup is empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Great idea. I added a fork-bomb test variant. I crashed my machine twice making this 😅
needs a rebase now |
license: GPL-3.0 | ||
apps: | ||
fork-bomb: | ||
command: bin/bomb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add another app called sh
with a trivial script which does exec and then there'd be no need to install test-snapd-sh (which also pulls in core IIRC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, the test is leaner now. Thanks!
36bb459
to
55b74c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sh_snap_bin="$(command -v fork-bomb.sh)" | ||
if [ "$BAD_SNAP" = "fork-bomb" ]; then | ||
#shellcheck disable=SC2016 | ||
systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that it doesn't blow up too much in the tests:
systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb' | |
systemd-run --unit test-kill.service -p TasksMax=1000 flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to only apply to the outer systemd-run, but snapd's systemd-run escapes the outer cgroup limits imposed by TasksMax.
Interestingly, I think we accidentally just discovered that snaps don't behave normally when run as part of systemd directly because snap run
calls systemd-run which create a separate unit/cgroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's unfortunate. Maybe this should be treated as a bug really, filed https://warthogs.atlassian.net/browse/SNAPDENG-32298 feel free to grab it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Some suggestions (and one open question), but looks good.
sandbox/cgroup/kill.go
Outdated
return err | ||
} | ||
return nil | ||
} | ||
|
||
var firstErr error | ||
skipError := func(err error) bool { | ||
if !errors.Is(err, fs.ErrNotExist) && firstErr == nil { | ||
// fs.ErrNotExist and ENODEV are ignored in case the cgroup went away while we were | ||
// processing the cgorup. ENODEV is returned by the kernel if the cgroup went |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// processing the cgorup. ENODEV is returned by the kernel if the cgroup went | |
// processing the cgroup. ENODEV is returned by the kernel if the cgroup went |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, thank you!
else | ||
#shellcheck disable=SC2016 | ||
systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; sleep 100000' | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this other variant have much value now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it, the normal variant tests that the feature works in the normal case and the other fork-bomb variant stresses a specific part of the code responsible for handling such processes. I like how the fork-bomb variant isolates where to look in case of a regression.
sandbox/cgroup/kill.go
Outdated
// fs.ErrNotExist and ENODEV are ignored in case the cgroup went away while we were | ||
// processing the cgorup. ENODEV is returned by the kernel if the cgroup went | ||
// away while a kernfs operation is ongoing. | ||
if !errors.Is(err, fs.ErrNotExist) && !errors.Is(err, syscall.ENODEV) && firstErr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the open
man page:
ENODEV pathname refers to a device special file and no corresponding device exists. (This is a Linux kernel bug; in this situation ENXIO must be returned.)
ENXIO The file is a device special file and no corresponding device exists.
I wonder if we should handle ENXIO too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok as is. The error handling branch is tailored to handle the error cases for which the overall operation of killing processes can be considered to have succeeded. Specifically ENOENT and then to account for the cgroup v1 prededent ENODEV. I do not see any evidence of ENXIO use under kernel/cgroup in the kernel tree, so I think it's ok for the code to fail explicitly should it even occur.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14513 +/- ##
==========================================
+ Coverage 78.85% 78.88% +0.02%
==========================================
Files 1079 1080 +1
Lines 145615 145922 +307
==========================================
+ Hits 114828 115111 +283
- Misses 23601 23618 +17
- Partials 7186 7193 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
For the record, after a long brainstorm session I expect this to change a little so I will keep my review queue open and just revisit next week. |
I've re-requested Maciej's review since we expect changes and not to land this in current state. |
d8ea366
to
027bbb8
Compare
@ZeyadYasser please let me know when this switches from draft to ready to review. I'll postpone until then. |
Indeed, I am currently running experiments and letting spread do its thing. I will let you know when this is ready to review. Thank you! |
027bbb8
to
de1b840
Compare
9ec791b
to
3fc8dda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. I cannot find any thing I would object to conceptually. I left a question and a small suggestion for a variable.
LGTM
thaw() | ||
} | ||
if firstErr != nil { | ||
return firstErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we ever encountered this in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, according the man 2 kill
it is unlikely because snapd is root
sandbox/cgroup/kill.go
Outdated
// - A bug in some kernel versions where sometimes a cgroup get stuck | ||
// in FREEZING state. Given that maxKillTimeout is bigger than timeout passed to freezer | ||
// This gives a chance to thaw the cgroup and trying again. | ||
ctxWithTimeout, cancel := context.WithTimeout(ctx, 1*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the timeout to a variable somewhere up in the module for better visibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, one comment about an unclosed file.
lock, err := snaplock.OpenLock(snapName) | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a defer lock.Close()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, Thank you!
unify termination algorithm for v1/v2 - for each snap cgroup: - while cgroup.procs is not empty: - SIGKILL each pid in cgroup.procs - for v1 only, also kill pids found in freezer cgroup created by snap-confine - this is relevant for systemd v237 (used in ubuntu 18.04) for non-root users where the transient scope cgroups are not created This logic drops the freeze/kill/thaw approach with all the weird v1/v2/kernel backward compatibility. Signed-off-by: Zeyad Gouda <[email protected]>
Signed-off-by: Zeyad Gouda <[email protected]>
This test variant stress-tests the new algorithm where snapd could be racing after a fork bomb without doing freezing first by continuously killing pids that show up until all pids are drained from cgroup. Signed-off-by: Zeyad Gouda <[email protected]>
Signed-off-by: Zeyad Gouda <[email protected]>
Signed-off-by: Zeyad Gouda <[email protected]>
Signed-off-by: Zeyad Gouda <[email protected]>
Signed-off-by: Zeyad Gouda <[email protected]>
…nCgroup Signed-off-by: Zeyad Gouda <[email protected]>
…esses for v1 Signed-off-by: Zeyad Gouda <[email protected]>
This syncs snap-confine and this task to make sure they are not racing on two important resources: - Remove inhibition lock (which snap-confine exits when observing) - V1 freezer cgroup (which snap-confine creates and joins) This is needed to address an issue in systemd v237 (used by Ubuntu 18.04) for non-root users where no tracking transient scope cgroups are created except the freezer cgroup which is created in snap-confine after the inhibition lock is release by "snap run". Effectively the sequence below is followed: - kill-snap-apps task holds snap lock - kill-snap-apps holds remove inhibition lock - snap-confine holds snap lock - snap-confine exits if remove inhibition lock exists - snap-confine creates/joins freezer Signed-off-by: Zeyad Gouda <[email protected]>
…up v1 When sending SIGKILL signals to snap pids in a frozen v1 cgroup a thaw must be done for those signals to take effect. Signed-off-by: Zeyad Gouda <[email protected]>
…roying test machine The fork-bomb test variant was destroying test machines especially those with older systemd versions where DefaultTaskMax was unlimited. This runs the fork-bomb test variant under a separate user whose TasksMax is limited. Signed-off-by: Zeyad Gouda <[email protected]>
…ariant Amazon Linux 2 does not support systemd --user needed by the fork-bomb variant of the test. Signed-off-by: Zeyad Gouda <[email protected]>
Signed-off-by: Zeyad Gouda <[email protected]>
f3bed7e
to
1ffcb7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nitpicks, otherwise LGTM
cmd/snap-confine/snap-confine.c
Outdated
|
||
// This is a workaround for systemd v237 (used by Ubuntu 18.04) for non-root users | ||
// where a transient scope cgroup is not created for a snap hence it cannot be tracked | ||
// before the freezer cgroup is created (and joind) below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// before the freezer cgroup is created (and joind) below. | |
// before the freezer cgroup is created (and joined) below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thank you!
sandbox/cgroup/kill.go
Outdated
// This is to address multiple edge cases: | ||
// (1) Hybrid v1/v2 cgroups with pids controller mounted only on v1 or v2 (Ubuntu 20.04) | ||
// so we cannot guarantee having pids.max so we use the freezer cgroup instead. | ||
// (2) Address a known bug on systemd v327 for non-root users where transient scopes are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's 237, isn't it?
// (2) Address a known bug on systemd v327 for non-root users where transient scopes are | |
// (2) Address a known bug on systemd v237 for non-root users where transient scopes are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, thank you!
sandbox/cgroup/kill.go
Outdated
} | ||
var firstErr error | ||
for _, pid := range pids { | ||
// This prevents a rouge fork bomb from keeping this loop running forever |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This prevents a rouge fork bomb from keeping this loop running forever | |
// This prevents a rogue fork bomb from keeping this loop running forever |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thank you!
Signed-off-by: Zeyad Gouda <[email protected]>
Failures:
fedora-os |
ubuntu-xenial-bionic |
ubuntu-core-20 |
ubuntu-arm |
|
Rerunning arm tests to prove failure is just due to flakiness... |
Current failing tests are known to fail and unrelated:
|
unify termination algorithm for v1/v2
Note: Terminating classic snap apps is best effort and cannot be guaranteed because a snap app can use systemd-run to break out of the cgroups used for tracking snap pids.
This PR should also fix flakey results seen in
tests/main/snap-remove-terminate
.