sandbox/cgroup: improve cgroup-based process termination algorithm #14513

ZeyadYasser · 2024-09-16T16:28:20Z

unify termination algorithm for v1/v2

for each snap cgroup:
- while cgroup.procs is not empty:
  - SIGKILL each pid in cgroup.procs
for v2 only
- set pids.max to 0 to counter fork-bombs (this is an alternative to cgroup freezing which is not available on older kernels)
for v1 only
- kill pids found in freezer cgroup created by snap-confine (this is relevant for systemd v237 (used in ubuntu 18.04) for non-root users where the transient scope cgroups are not created)
- freeze/thaw using freezer cgroup because we cannot guarantee having the pids controller on hybrid systems on the cgroup version we are using

Note: Terminating classic snap apps is best effort and cannot be guaranteed because a snap app can use systemd-run to break out of the cgroups used for tracking snap pids.

This PR should also fix flakey results seen in tests/main/snap-remove-terminate.

ZeyadYasser · 2024-09-16T20:24:59Z

Something is going wrong in the new approach, need to investigate why cgroup.procs reports no such device instead of file does not exist.

+ snap remove --terminate test-snapd-sh
error: cannot perform the following tasks:
- Kill running snap "test-snapd-sh" apps (read /sys/fs/cgroup/user.slice/user-0.slice/[email protected]/app.slice/snap.test-snapd-sh.sh-0e68ddf6-d14f-4c3c-9359-94bf510a9b8d.scope/cgroup.procs: no such device)

ZeyadYasser · 2024-09-17T18:48:55Z

Something is going wrong in the new approach, need to investigate why cgroup.procs reports no such device instead of file does not exist.
+ snap remove --terminate test-snapd-sh
error: cannot perform the following tasks:
- Kill running snap "test-snapd-sh" apps (read /sys/fs/cgroup/user.slice/user-0.slice/[email protected]/app.slice/snap.test-snapd-sh.sh-0e68ddf6-d14f-4c3c-9359-94bf510a9b8d.scope/cgroup.procs: no such device)

Found root cause of ENODEV:

https://github.com/torvalds/linux/blob/a940d9a43e623d1ba1e5c499aa843516656c0ae4/fs/kernfs/file.c#L592-L603 kernfs->kernfs_fop_open
https://github.com/torvalds/linux/blob/a940d9a43e623d1ba1e5c499aa843516656c0ae4/fs/kernfs/file.c#L145-L156 kernfs->kernfs_seq_start

It is kernfs file operations behavior where it checks that the file is still there before continuing. It should be safe to skip errors for ENODEV similar to skipping on fs.ErrNotExist.

What triggers ENODEV is starting opening/reading cgroup.procs and have systemd remove the cgroup at the same time while the kernel operation is ongoing.

bboozzoo · 2024-09-18T12:37:50Z

sandbox/cgroup/kill.go

-//
-//  1. Cgroup v2 freezer was only available since Linux 5.2 so freezing is a no-op before 5.2 which allows processes to keep forking.
-//  2. Freezing does not put processes in an uninterruptable sleep unlike v1, so they can be killed externally and have their pid reused.
-//  3. `cgroup.kill` was introduced in Linux 5.14 and solves the above issues as it kills the cgroup processes atomically.
 func killSnapProcessesImplV2(ctx context.Context, snapName string) error {
 	killCgroupProcs := func(dir string) error {
 		// Use cgroup.kill if it exists (requires linux 5.14+)
 		err := writeExistingFile(filepath.Join(dir, "cgroup.kill"), []byte("1"))
 		if err == nil || !errors.Is(err, fs.ErrNotExist) {


given ENODEV appearing in certain scenarios, I wonder if it is possible for it to show up here as well. When looking at the kernel, could ENODEV be returned for every cgroup meta-file?

Good point, I just checked and weirdly enough, No.

https://elixir.bootlin.com/linux/v6.11/source/kernel/cgroup/cgroup.c#L4030, but I think we should a check for good measure if kernel behavior decided to change later.

It is surprising to see ENOENT/ENODEV thrown everywhere like that without distinction in the kernel source code.

bboozzoo · 2024-09-18T12:42:09Z

sandbox/cgroup/kill.go

-// pids read earlier.
+// Note: When cgroup v1 is detected, the call will also act on the freezer
+// group created when a snap process was started to address a known bug on
+// systemd v327 for non-root users.


maybe worth mentioning that this is only useful for killing apps or processes which do not have their lifecycle managed by external entities like systemd

bboozzoo · 2024-09-18T12:46:51Z

sandbox/cgroup/kill.go

+	// Keep sending SIGKILL signals until no more pids are left in cgroup
+	// to cover the case where a process forks before we kill it.
+	for {
+		// XXX: Should this have maximum retries?


perhaps we could have a spread test which essential does a fork bomb inside the snap, and with this code we should still arrive at a stable state when the cgorup is empty

Thanks, Great idea. I added a fork-bomb test variant. I crashed my machine twice making this 😅

bboozzoo · 2024-09-19T07:38:00Z

needs a rebase now

tests/lib/snaps/fork-bomb/meta/snap.yaml

bboozzoo · 2024-09-19T07:40:09Z

tests/lib/snaps/fork-bomb/meta/snap.yaml

+license: GPL-3.0
+apps:
+ fork-bomb:
+  command: bin/bomb


you can add another app called sh with a trivial script which does exec and then there'd be no need to install test-snapd-sh (which also pulls in core IIRC)

Nice, the test is leaner now. Thanks!

bboozzoo

LGTM

bboozzoo · 2024-09-19T12:38:15Z

tests/main/snap-remove-terminate/task.yaml

+    sh_snap_bin="$(command -v fork-bomb.sh)"
+    if [ "$BAD_SNAP" = "fork-bomb" ]; then
+        #shellcheck disable=SC2016
+        systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb'


so that it doesn't blow up too much in the tests:

Suggested change

systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb'

systemd-run --unit test-kill.service -p TasksMax=1000 flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb'

It seems to only apply to the outer systemd-run, but snapd's systemd-run escapes the outer cgroup limits imposed by TasksMax.

Interestingly, I think we accidentally just discovered that snaps don't behave normally when run as part of systemd directly because snap run calls systemd-run which create a separate unit/cgroup.

Yeah, that's unfortunate. Maybe this should be treated as a bug really, filed https://warthogs.atlassian.net/browse/SNAPDENG-32298 feel free to grab it

andrewphelpsj

Thanks! Some suggestions (and one open question), but looks good.

andrewphelpsj · 2024-09-19T13:23:32Z

sandbox/cgroup/kill.go

 			return err
 		}
 		return nil
 	}

 	var firstErr error
 	skipError := func(err error) bool {
-		if !errors.Is(err, fs.ErrNotExist) && firstErr == nil {
+		// fs.ErrNotExist and ENODEV are ignored in case the cgroup went away while we were
+		// processing the cgorup. ENODEV is returned by the kernel if the cgroup went


Suggested change

// processing the cgorup. ENODEV is returned by the kernel if the cgroup went

// processing the cgroup. ENODEV is returned by the kernel if the cgroup went

good catch, thank you!

andrewphelpsj · 2024-09-19T13:28:47Z

tests/main/snap-remove-terminate/task.yaml

+    else
+        #shellcheck disable=SC2016
+        systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; sleep 100000'
+    fi


Does this other variant have much value now?

The way I see it, the normal variant tests that the feature works in the normal case and the other fork-bomb variant stresses a specific part of the code responsible for handling such processes. I like how the fork-bomb variant isolates where to look in case of a regression.

andrewphelpsj · 2024-09-19T13:51:38Z

sandbox/cgroup/kill.go

+		// fs.ErrNotExist and ENODEV are ignored in case the cgroup went away while we were
+		// processing the cgorup. ENODEV is returned by the kernel if the cgroup went
+		// away while a kernfs operation is ongoing.
+		if !errors.Is(err, fs.ErrNotExist) && !errors.Is(err, syscall.ENODEV) && firstErr == nil {


From the open man page:

ENODEV pathname refers to a device special file and no corresponding device exists. (This is a Linux kernel bug; in this situation ENXIO must be returned.) ENXIO The file is a device special file and no corresponding device exists.

I wonder if we should handle ENXIO too?

I think it's ok as is. The error handling branch is tailored to handle the error cases for which the overall operation of killing processes can be considered to have succeeded. Specifically ENOENT and then to account for the cgroup v1 prededent ENODEV. I do not see any evidence of ENXIO use under kernel/cgroup in the kernel tree, so I think it's ok for the code to fail explicitly should it even occur.

codecov · 2024-09-19T19:20:50Z

Codecov Report

Attention: Patch coverage is 89.71963% with 11 lines in your changes missing coverage. Please review.

Project coverage is 78.88%. Comparing base (ac897ee) to head (9454158).
Report is 44 commits behind head on master.

Files with missing lines	Patch %	Lines
sandbox/cgroup/kill.go	91.01%	4 Missing and 4 partials ⚠️
overlord/snapstate/handlers.go	81.25%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #14513      +/-   ##
==========================================
+ Coverage   78.85%   78.88%   +0.02%     
==========================================
  Files        1079     1080       +1     
  Lines      145615   145922     +307     
==========================================
+ Hits       114828   115111     +283     
- Misses      23601    23618      +17     
- Partials     7186     7193       +7

Flag	Coverage Δ
unittests	`78.88% <89.71%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zyga · 2024-09-20T10:35:19Z

For the record, after a long brainstorm session I expect this to change a little so I will keep my review queue open and just revisit next week.

zyga · 2024-09-20T13:44:48Z

I've re-requested Maciej's review since we expect changes and not to land this in current state.

zyga · 2024-09-25T10:56:53Z

@ZeyadYasser please let me know when this switches from draft to ready to review. I'll postpone until then.

ZeyadYasser · 2024-09-25T11:02:13Z

@ZeyadYasser please let me know when this switches from draft to ready to review. I'll postpone until then.

Indeed, I am currently running experiments and letting spread do its thing. I will let you know when this is ready to review. Thank you!

zyga

I think this looks good. I cannot find any thing I would object to conceptually. I left a question and a small suggestion for a variable.

LGTM

zyga · 2024-09-27T08:31:25Z

sandbox/cgroup/kill.go

+			thaw()
+		}
+		if firstErr != nil {
+			return firstErr


Have we ever encountered this in practice?

No, according the man 2 kill it is unlikely because snapd is root

zyga · 2024-09-27T08:32:15Z

sandbox/cgroup/kill.go

+		//   - A bug in some kernel versions where sometimes a cgroup get stuck
+		//     in FREEZING state. Given that maxKillTimeout is bigger than timeout passed to freezer
+		//     This gives a chance to thaw the cgroup and trying again.
+		ctxWithTimeout, cancel := context.WithTimeout(ctx, 1*time.Second)


I would move the timeout to a variable somewhere up in the module for better visibility.

andrewphelpsj

Looks good, one comment about an unclosed file.

andrewphelpsj · 2024-09-27T17:42:57Z

overlord/snapstate/handlers.go

+	lock, err := snaplock.OpenLock(snapName)
+	if err != nil {
+		return err
+	}


Missing a defer lock.Close() here?

Great catch, Thank you!

unify termination algorithm for v1/v2 - for each snap cgroup: - while cgroup.procs is not empty: - SIGKILL each pid in cgroup.procs - for v1 only, also kill pids found in freezer cgroup created by snap-confine - this is relevant for systemd v237 (used in ubuntu 18.04) for non-root users where the transient scope cgroups are not created This logic drops the freeze/kill/thaw approach with all the weird v1/v2/kernel backward compatibility. Signed-off-by: Zeyad Gouda <[email protected]>

Signed-off-by: Zeyad Gouda <[email protected]>

This test variant stress-tests the new algorithm where snapd could be racing after a fork bomb without doing freezing first by continuously killing pids that show up until all pids are drained from cgroup. Signed-off-by: Zeyad Gouda <[email protected]>

Signed-off-by: Zeyad Gouda <[email protected]>

…nCgroup Signed-off-by: Zeyad Gouda <[email protected]>

…esses for v1 Signed-off-by: Zeyad Gouda <[email protected]>

This syncs snap-confine and this task to make sure they are not racing on two important resources: - Remove inhibition lock (which snap-confine exits when observing) - V1 freezer cgroup (which snap-confine creates and joins) This is needed to address an issue in systemd v237 (used by Ubuntu 18.04) for non-root users where no tracking transient scope cgroups are created except the freezer cgroup which is created in snap-confine after the inhibition lock is release by "snap run". Effectively the sequence below is followed: - kill-snap-apps task holds snap lock - kill-snap-apps holds remove inhibition lock - snap-confine holds snap lock - snap-confine exits if remove inhibition lock exists - snap-confine creates/joins freezer Signed-off-by: Zeyad Gouda <[email protected]>

…up v1 When sending SIGKILL signals to snap pids in a frozen v1 cgroup a thaw must be done for those signals to take effect. Signed-off-by: Zeyad Gouda <[email protected]>

…roying test machine The fork-bomb test variant was destroying test machines especially those with older systemd versions where DefaultTaskMax was unlimited. This runs the fork-bomb test variant under a separate user whose TasksMax is limited. Signed-off-by: Zeyad Gouda <[email protected]>

…ariant Amazon Linux 2 does not support systemd --user needed by the fork-bomb variant of the test. Signed-off-by: Zeyad Gouda <[email protected]>

Signed-off-by: Zeyad Gouda <[email protected]>

bboozzoo

Just some nitpicks, otherwise LGTM

bboozzoo · 2024-09-30T05:40:11Z

cmd/snap-confine/snap-confine.c

+
+	// This is a workaround for systemd v237 (used by Ubuntu 18.04) for non-root users
+	// where a transient scope cgroup is not created for a snap hence it cannot be tracked
+	// before the freezer cgroup is created (and joind) below.


Suggested change

// before the freezer cgroup is created (and joind) below.

// before the freezer cgroup is created (and joined) below.

updated, thank you!

bboozzoo · 2024-09-30T05:43:16Z

sandbox/cgroup/kill.go

+//     This is to address multiple edge cases:
+//     (1) Hybrid v1/v2 cgroups with pids controller mounted only on v1 or v2 (Ubuntu 20.04)
+//     so we cannot guarantee having pids.max so we use the freezer cgroup instead.
+//     (2) Address a known bug on systemd v327 for non-root users where transient scopes are


it's 237, isn't it?

Suggested change

// (2) Address a known bug on systemd v327 for non-root users where transient scopes are

// (2) Address a known bug on systemd v237 for non-root users where transient scopes are

nice catch, thank you!

bboozzoo · 2024-09-30T05:44:50Z

sandbox/cgroup/kill.go

+		}
+		var firstErr error
+		for _, pid := range pids {
+			// This prevents a rouge fork bomb from keeping this loop running forever


Suggested change

// This prevents a rouge fork bomb from keeping this loop running forever

// This prevents a rogue fork bomb from keeping this loop running forever

updated, thank you!

Signed-off-by: Zeyad Gouda <[email protected]>

ernestl · 2024-09-30T09:29:25Z

Failures:
debian-not-req |

google-distro-1:debian-sid-64:tests/main/snap-run | unrelated

fedora-os |

openstack:fedora-40-64:tests/main/selinux-clean | unrelated

ubuntu-xenial-bionic |

google:ubuntu-18.04-64:tests/main/snap-user-service-start-on-install | unrelated

ubuntu-core-20 |

google-core:ubuntu-core-20-64:tests/regression/mount-order-regression | known flaky, unrelated

ubuntu-arm |

google-arm:ubuntu-20.04-arm-64:tests/unit/go:gcc
2024-09-30T08:21:12.4637313Z FAIL: snapstate_update_test.go:1607:
snapmgrTestSuite.TestUpdateWithNewDefaultProvider
2024-09-30T08:21:12.4638118Z
2024-09-30T08:21:12.4638420Z snapstate_update_test.go:1634:
2024-09-30T08:21:12.4639101Z s.settle(c)
2024-09-30T08:21:12.4639700Z snapstate_test.go:108:
2024-09-30T08:21:12.4640319Z c.Error(err)
2024-09-30T08:21:12.4641142Z ... Error: Settle is not converging

ernestl · 2024-09-30T09:40:53Z

Rerunning arm tests to prove failure is just due to flakiness...

ZeyadYasser · 2024-10-01T06:26:26Z

Current failing tests are known to fail and unrelated:

google-distro-1:debian-sid-64:tests/main/snap-run
openstack:fedora-40-64:tests/main/selinux-clean
google:ubuntu-18.04-64:tests/main/snap-user-service-start-on-install
google-core:ubuntu-core-20-64:tests/regression/mount-order-regression

ZeyadYasser requested review from zyga and andrewphelpsj September 16, 2024 16:28

ZeyadYasser added this to the 2.66 milestone Sep 16, 2024

ZeyadYasser requested a review from bboozzoo September 16, 2024 16:29

ZeyadYasser added the ⛔ Blocked label Sep 17, 2024

ZeyadYasser force-pushed the improve-cgroup-based-process-termination branch from c6a5813 to 371e0eb Compare September 17, 2024 18:44

ZeyadYasser removed the ⛔ Blocked label Sep 17, 2024

bboozzoo reviewed Sep 18, 2024

View reviewed changes

ZeyadYasser requested a review from bboozzoo September 18, 2024 20:44

bboozzoo reviewed Sep 19, 2024

View reviewed changes

ZeyadYasser force-pushed the improve-cgroup-based-process-termination branch from 36bb459 to 55b74c3 Compare September 19, 2024 12:00

ZeyadYasser requested a review from bboozzoo September 19, 2024 12:34

bboozzoo approved these changes Sep 19, 2024

View reviewed changes

andrewphelpsj approved these changes Sep 19, 2024

View reviewed changes

zyga requested a review from bboozzoo September 20, 2024 13:44

ZeyadYasser marked this pull request as draft September 24, 2024 14:58

ZeyadYasser force-pushed the improve-cgroup-based-process-termination branch 5 times, most recently from d8ea366 to 027bbb8 Compare September 25, 2024 10:52

ZeyadYasser force-pushed the improve-cgroup-based-process-termination branch from 027bbb8 to de1b840 Compare September 25, 2024 21:42

ZeyadYasser force-pushed the improve-cgroup-based-process-termination branch 4 times, most recently from 9ec791b to 3fc8dda Compare September 26, 2024 12:07

ZeyadYasser marked this pull request as ready for review September 26, 2024 12:16

zyga approved these changes Sep 27, 2024

View reviewed changes

ernestl requested a review from andrewphelpsj September 27, 2024 13:38

andrewphelpsj approved these changes Sep 27, 2024

View reviewed changes

ZeyadYasser added 14 commits September 29, 2024 19:50

sandbox/cgroup: address review comments

8e24620

Signed-off-by: Zeyad Gouda <[email protected]>

tests: address review comments

1f0f275

Signed-off-by: Zeyad Gouda <[email protected]>

many: address review comments

f0012ce

Signed-off-by: Zeyad Gouda <[email protected]>

tests: fix wrong binary path

c4cfbd7

Signed-off-by: Zeyad Gouda <[email protected]>

sandbox/cgroup: address fork bombs in KillSnapProcesses

8013083

Signed-off-by: Zeyad Gouda <[email protected]>

sandbox/cgroup: add context propagation and timeout to killProcessesI…

310f356

…nCgroup Signed-off-by: Zeyad Gouda <[email protected]>

sandbox/cgroup: don't use freezer cgroup for tracking in KillSnapProc…

37c8531

…esses for v1 Signed-off-by: Zeyad Gouda <[email protected]>

sandbox/cgroup: freeze/thaw per cgroup when killing snap apps on cgor…

b218ec5

…up v1 When sending SIGKILL signals to snap pids in a frozen v1 cgroup a thaw must be done for those signals to take effect. Signed-off-by: Zeyad Gouda <[email protected]>

tests/main/snap-remove-terminate: skip amazon-linux-2 for fork-bomb v…

0006bc5

…ariant Amazon Linux 2 does not support systemd --user needed by the fork-bomb variant of the test. Signed-off-by: Zeyad Gouda <[email protected]>

many: address review comments

1ffcb7c

Signed-off-by: Zeyad Gouda <[email protected]>

ZeyadYasser force-pushed the improve-cgroup-based-process-termination branch from f3bed7e to 1ffcb7c Compare September 29, 2024 16:50

bboozzoo approved these changes Sep 30, 2024

View reviewed changes

many: fix comment typos

9454158

Signed-off-by: Zeyad Gouda <[email protected]>

Meulengracht merged commit e9b341e into canonical:master Oct 1, 2024
50 of 54 checks passed

	systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb'
	systemd-run --unit test-kill.service -p TasksMax=1000 flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb'

	// processing the cgorup. ENODEV is returned by the kernel if the cgroup went
	// processing the cgroup. ENODEV is returned by the kernel if the cgroup went

	// before the freezer cgroup is created (and joind) below.
	// before the freezer cgroup is created (and joined) below.

	// (2) Address a known bug on systemd v327 for non-root users where transient scopes are
	// (2) Address a known bug on systemd v237 for non-root users where transient scopes are

	// This prevents a rouge fork bomb from keeping this loop running forever
	// This prevents a rogue fork bomb from keeping this loop running forever

sandbox/cgroup: improve cgroup-based process termination algorithm #14513

sandbox/cgroup: improve cgroup-based process termination algorithm #14513

Conversation

ZeyadYasser commented Sep 16, 2024 • edited Loading

ZeyadYasser commented Sep 16, 2024

ZeyadYasser commented Sep 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bboozzoo commented Sep 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bboozzoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewphelpsj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 19, 2024 • edited Loading

Codecov Report

zyga commented Sep 20, 2024

zyga commented Sep 20, 2024

zyga commented Sep 25, 2024

ZeyadYasser commented Sep 25, 2024

zyga left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewphelpsj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bboozzoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ernestl commented Sep 30, 2024 • edited Loading

ernestl commented Sep 30, 2024

ZeyadYasser commented Oct 1, 2024

ZeyadYasser commented Sep 16, 2024 •

edited

Loading

codecov bot commented Sep 19, 2024 •

edited

Loading

ernestl commented Sep 30, 2024 •

edited

Loading