Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sandbox/cgroup: improve cgroup-based process termination algorithm #14513

Conversation

ZeyadYasser
Copy link
Contributor

@ZeyadYasser ZeyadYasser commented Sep 16, 2024

unify termination algorithm for v1/v2

  • for each snap cgroup:
    • while cgroup.procs is not empty:
      • SIGKILL each pid in cgroup.procs
  • for v2 only
    • set pids.max to 0 to counter fork-bombs (this is an alternative to cgroup freezing which is not available on older kernels)
  • for v1 only
    • kill pids found in freezer cgroup created by snap-confine (this is relevant for systemd v237 (used in ubuntu 18.04) for non-root users where the transient scope cgroups are not created)
    • freeze/thaw using freezer cgroup because we cannot guarantee having the pids controller on hybrid systems on the cgroup version we are using

Note: Terminating classic snap apps is best effort and cannot be guaranteed because a snap app can use systemd-run to break out of the cgroups used for tracking snap pids.

This PR should also fix flakey results seen in tests/main/snap-remove-terminate.

@ZeyadYasser ZeyadYasser added this to the 2.66 milestone Sep 16, 2024
@ZeyadYasser
Copy link
Contributor Author

Something is going wrong in the new approach, need to investigate why cgroup.procs reports no such device instead of file does not exist.

+ snap remove --terminate test-snapd-sh
error: cannot perform the following tasks:
- Kill running snap "test-snapd-sh" apps (read /sys/fs/cgroup/user.slice/user-0.slice/[email protected]/app.slice/snap.test-snapd-sh.sh-0e68ddf6-d14f-4c3c-9359-94bf510a9b8d.scope/cgroup.procs: no such device)

@ZeyadYasser
Copy link
Contributor Author

Something is going wrong in the new approach, need to investigate why cgroup.procs reports no such device instead of file does not exist.

+ snap remove --terminate test-snapd-sh
error: cannot perform the following tasks:
- Kill running snap "test-snapd-sh" apps (read /sys/fs/cgroup/user.slice/user-0.slice/[email protected]/app.slice/snap.test-snapd-sh.sh-0e68ddf6-d14f-4c3c-9359-94bf510a9b8d.scope/cgroup.procs: no such device)

Found root cause of ENODEV:

It is kernfs file operations behavior where it checks that the file is still there before continuing. It should be safe to skip errors for ENODEV similar to skipping on fs.ErrNotExist.

What triggers ENODEV is starting opening/reading cgroup.procs and have systemd remove the cgroup at the same time while the kernel operation is ongoing.

//
// 1. Cgroup v2 freezer was only available since Linux 5.2 so freezing is a no-op before 5.2 which allows processes to keep forking.
// 2. Freezing does not put processes in an uninterruptable sleep unlike v1, so they can be killed externally and have their pid reused.
// 3. `cgroup.kill` was introduced in Linux 5.14 and solves the above issues as it kills the cgroup processes atomically.
func killSnapProcessesImplV2(ctx context.Context, snapName string) error {
killCgroupProcs := func(dir string) error {
// Use cgroup.kill if it exists (requires linux 5.14+)
err := writeExistingFile(filepath.Join(dir, "cgroup.kill"), []byte("1"))
if err == nil || !errors.Is(err, fs.ErrNotExist) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given ENODEV appearing in certain scenarios, I wonder if it is possible for it to show up here as well. When looking at the kernel, could ENODEV be returned for every cgroup meta-file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I just checked and weirdly enough, No.

https://elixir.bootlin.com/linux/v6.11/source/kernel/cgroup/cgroup.c#L4030, but I think we should a check for good measure if kernel behavior decided to change later.

It is surprising to see ENOENT/ENODEV thrown everywhere like that without distinction in the kernel source code.

// pids read earlier.
// Note: When cgroup v1 is detected, the call will also act on the freezer
// group created when a snap process was started to address a known bug on
// systemd v327 for non-root users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth mentioning that this is only useful for killing apps or processes which do not have their lifecycle managed by external entities like systemd

// Keep sending SIGKILL signals until no more pids are left in cgroup
// to cover the case where a process forks before we kill it.
for {
// XXX: Should this have maximum retries?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we could have a spread test which essential does a fork bomb inside the snap, and with this code we should still arrive at a stable state when the cgorup is empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Great idea. I added a fork-bomb test variant. I crashed my machine twice making this 😅

@bboozzoo
Copy link
Contributor

needs a rebase now

tests/lib/snaps/fork-bomb/meta/snap.yaml Show resolved Hide resolved
license: GPL-3.0
apps:
fork-bomb:
command: bin/bomb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add another app called sh with a trivial script which does exec and then there'd be no need to install test-snapd-sh (which also pulls in core IIRC)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, the test is leaner now. Thanks!

@ZeyadYasser ZeyadYasser force-pushed the improve-cgroup-based-process-termination branch from 36bb459 to 55b74c3 Compare September 19, 2024 12:00
Copy link
Contributor

@bboozzoo bboozzoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sh_snap_bin="$(command -v fork-bomb.sh)"
if [ "$BAD_SNAP" = "fork-bomb" ]; then
#shellcheck disable=SC2016
systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that it doesn't blow up too much in the tests:

Suggested change
systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb'
systemd-run --unit test-kill.service -p TasksMax=1000 flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; $SNAP/bin/fork-bomb'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to only apply to the outer systemd-run, but snapd's systemd-run escapes the outer cgroup limits imposed by TasksMax.

Interestingly, I think we accidentally just discovered that snaps don't behave normally when run as part of systemd directly because snap run calls systemd-run which create a separate unit/cgroup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's unfortunate. Maybe this should be treated as a bug really, filed https://warthogs.atlassian.net/browse/SNAPDENG-32298 feel free to grab it

Copy link
Member

@andrewphelpsj andrewphelpsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some suggestions (and one open question), but looks good.

return err
}
return nil
}

var firstErr error
skipError := func(err error) bool {
if !errors.Is(err, fs.ErrNotExist) && firstErr == nil {
// fs.ErrNotExist and ENODEV are ignored in case the cgroup went away while we were
// processing the cgorup. ENODEV is returned by the kernel if the cgroup went
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// processing the cgorup. ENODEV is returned by the kernel if the cgroup went
// processing the cgroup. ENODEV is returned by the kernel if the cgroup went

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thank you!

Comment on lines 34 to 59
else
#shellcheck disable=SC2016
systemd-run --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch /var/snap/test-snapd-sh/common/alive; sleep 100000'
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this other variant have much value now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I see it, the normal variant tests that the feature works in the normal case and the other fork-bomb variant stresses a specific part of the code responsible for handling such processes. I like how the fork-bomb variant isolates where to look in case of a regression.

// fs.ErrNotExist and ENODEV are ignored in case the cgroup went away while we were
// processing the cgorup. ENODEV is returned by the kernel if the cgroup went
// away while a kernfs operation is ongoing.
if !errors.Is(err, fs.ErrNotExist) && !errors.Is(err, syscall.ENODEV) && firstErr == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the open man page:

ENODEV pathname refers to a device special file and no corresponding device exists.  (This is a Linux kernel bug; in this situation ENXIO must be returned.)
ENXIO  The file is a device special file and no corresponding device exists.

I wonder if we should handle ENXIO too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok as is. The error handling branch is tailored to handle the error cases for which the overall operation of killing processes can be considered to have succeeded. Specifically ENOENT and then to account for the cgroup v1 prededent ENODEV. I do not see any evidence of ENXIO use under kernel/cgroup in the kernel tree, so I think it's ok for the code to fail explicitly should it even occur.

Copy link

codecov bot commented Sep 19, 2024

Codecov Report

Attention: Patch coverage is 89.71963% with 11 lines in your changes missing coverage. Please review.

Project coverage is 78.88%. Comparing base (ac897ee) to head (9454158).
Report is 44 commits behind head on master.

Files with missing lines Patch % Lines
sandbox/cgroup/kill.go 91.01% 4 Missing and 4 partials ⚠️
overlord/snapstate/handlers.go 81.25% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #14513      +/-   ##
==========================================
+ Coverage   78.85%   78.88%   +0.02%     
==========================================
  Files        1079     1080       +1     
  Lines      145615   145922     +307     
==========================================
+ Hits       114828   115111     +283     
- Misses      23601    23618      +17     
- Partials     7186     7193       +7     
Flag Coverage Δ
unittests 78.88% <89.71%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zyga
Copy link
Contributor

zyga commented Sep 20, 2024

For the record, after a long brainstorm session I expect this to change a little so I will keep my review queue open and just revisit next week.

@zyga zyga requested a review from bboozzoo September 20, 2024 13:44
@zyga
Copy link
Contributor

zyga commented Sep 20, 2024

I've re-requested Maciej's review since we expect changes and not to land this in current state.

@ZeyadYasser ZeyadYasser marked this pull request as draft September 24, 2024 14:58
@ZeyadYasser ZeyadYasser force-pushed the improve-cgroup-based-process-termination branch 5 times, most recently from d8ea366 to 027bbb8 Compare September 25, 2024 10:52
@zyga
Copy link
Contributor

zyga commented Sep 25, 2024

@ZeyadYasser please let me know when this switches from draft to ready to review. I'll postpone until then.

@ZeyadYasser
Copy link
Contributor Author

@ZeyadYasser please let me know when this switches from draft to ready to review. I'll postpone until then.

Indeed, I am currently running experiments and letting spread do its thing. I will let you know when this is ready to review. Thank you!

@ZeyadYasser ZeyadYasser force-pushed the improve-cgroup-based-process-termination branch from 027bbb8 to de1b840 Compare September 25, 2024 21:42
@ZeyadYasser ZeyadYasser force-pushed the improve-cgroup-based-process-termination branch 4 times, most recently from 9ec791b to 3fc8dda Compare September 26, 2024 12:07
@ZeyadYasser ZeyadYasser marked this pull request as ready for review September 26, 2024 12:16
Copy link
Contributor

@zyga zyga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. I cannot find any thing I would object to conceptually. I left a question and a small suggestion for a variable.

LGTM

thaw()
}
if firstErr != nil {
return firstErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we ever encountered this in practice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, according the man 2 kill it is unlikely because snapd is root

// - A bug in some kernel versions where sometimes a cgroup get stuck
// in FREEZING state. Given that maxKillTimeout is bigger than timeout passed to freezer
// This gives a chance to thaw the cgroup and trying again.
ctxWithTimeout, cancel := context.WithTimeout(ctx, 1*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the timeout to a variable somewhere up in the module for better visibility.

Copy link
Member

@andrewphelpsj andrewphelpsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one comment about an unclosed file.

lock, err := snaplock.OpenLock(snapName)
if err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a defer lock.Close() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, Thank you!

unify termination algorithm for v1/v2
- for each snap cgroup:
  - while cgroup.procs is not empty:
    - SIGKILL each pid in cgroup.procs
- for v1 only, also kill pids found in freezer cgroup created by snap-confine
  - this is relevant for systemd v237 (used in ubuntu 18.04) for non-root users where the transient scope cgroups are not created

This logic drops the freeze/kill/thaw approach with all the weird v1/v2/kernel backward compatibility.

Signed-off-by: Zeyad Gouda <[email protected]>
This test variant stress-tests the new algorithm where snapd could be racing
after a fork bomb without doing freezing first by continuously killing pids
that show up until all pids are drained from cgroup.

Signed-off-by: Zeyad Gouda <[email protected]>
Signed-off-by: Zeyad Gouda <[email protected]>
This syncs snap-confine and this task to make sure they are not racing
on two important resources:
   - Remove inhibition lock (which snap-confine exits when observing)
   - V1 freezer cgroup (which snap-confine creates and joins)

This is needed to address an issue in systemd v237 (used by Ubuntu 18.04) for
non-root users where no tracking transient scope cgroups are created except
the freezer cgroup which is created in snap-confine after the inhibition lock
is release by "snap run".

Effectively the sequence below is followed:
- kill-snap-apps task holds snap lock
- kill-snap-apps holds remove inhibition lock
- snap-confine holds snap lock
- snap-confine exits if remove inhibition lock exists
- snap-confine creates/joins freezer

Signed-off-by: Zeyad Gouda <[email protected]>
…up v1

When sending SIGKILL signals to snap pids in a frozen v1 cgroup a thaw must
be done for those signals to take effect.

Signed-off-by: Zeyad Gouda <[email protected]>
…roying test machine

The fork-bomb test variant was destroying test machines especially those with
older systemd versions where DefaultTaskMax was unlimited.

This runs the fork-bomb test variant under a separate user whose TasksMax is
limited.

Signed-off-by: Zeyad Gouda <[email protected]>
…ariant

Amazon Linux 2 does not support systemd --user needed by the
fork-bomb variant of the test.

Signed-off-by: Zeyad Gouda <[email protected]>
@ZeyadYasser ZeyadYasser force-pushed the improve-cgroup-based-process-termination branch from f3bed7e to 1ffcb7c Compare September 29, 2024 16:50
Copy link
Contributor

@bboozzoo bboozzoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nitpicks, otherwise LGTM


// This is a workaround for systemd v237 (used by Ubuntu 18.04) for non-root users
// where a transient scope cgroup is not created for a snap hence it cannot be tracked
// before the freezer cgroup is created (and joind) below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// before the freezer cgroup is created (and joind) below.
// before the freezer cgroup is created (and joined) below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thank you!

// This is to address multiple edge cases:
// (1) Hybrid v1/v2 cgroups with pids controller mounted only on v1 or v2 (Ubuntu 20.04)
// so we cannot guarantee having pids.max so we use the freezer cgroup instead.
// (2) Address a known bug on systemd v327 for non-root users where transient scopes are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's 237, isn't it?

Suggested change
// (2) Address a known bug on systemd v327 for non-root users where transient scopes are
// (2) Address a known bug on systemd v237 for non-root users where transient scopes are

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, thank you!

}
var firstErr error
for _, pid := range pids {
// This prevents a rouge fork bomb from keeping this loop running forever
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This prevents a rouge fork bomb from keeping this loop running forever
// This prevents a rogue fork bomb from keeping this loop running forever

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thank you!

Signed-off-by: Zeyad Gouda <[email protected]>
@ernestl
Copy link
Collaborator

ernestl commented Sep 30, 2024

Failures:
debian-not-req |

  • google-distro-1:debian-sid-64:tests/main/snap-run | unrelated

fedora-os |

  • openstack:fedora-40-64:tests/main/selinux-clean | unrelated

ubuntu-xenial-bionic |

  • google:ubuntu-18.04-64:tests/main/snap-user-service-start-on-install | unrelated

ubuntu-core-20 |

  • google-core:ubuntu-core-20-64:tests/regression/mount-order-regression | known flaky, unrelated

ubuntu-arm |

  • google-arm:ubuntu-20.04-arm-64:tests/unit/go:gcc
    2024-09-30T08:21:12.4637313Z FAIL: snapstate_update_test.go:1607:
    snapmgrTestSuite.TestUpdateWithNewDefaultProvider
    2024-09-30T08:21:12.4638118Z
    2024-09-30T08:21:12.4638420Z snapstate_update_test.go:1634:
    2024-09-30T08:21:12.4639101Z s.settle(c)
    2024-09-30T08:21:12.4639700Z snapstate_test.go:108:
    2024-09-30T08:21:12.4640319Z c.Error(err)
    2024-09-30T08:21:12.4641142Z ... Error: Settle is not converging

@ernestl
Copy link
Collaborator

ernestl commented Sep 30, 2024

Rerunning arm tests to prove failure is just due to flakiness...

@ZeyadYasser
Copy link
Contributor Author

Current failing tests are known to fail and unrelated:

  • google-distro-1:debian-sid-64:tests/main/snap-run
  • openstack:fedora-40-64:tests/main/selinux-clean
  • google:ubuntu-18.04-64:tests/main/snap-user-service-start-on-install
  • google-core:ubuntu-core-20-64:tests/regression/mount-order-regression

@Meulengracht Meulengracht merged commit e9b341e into canonical:master Oct 1, 2024
50 of 54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants