Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oom detection #365

Merged
merged 4 commits into from
May 10, 2022
Merged

oom detection #365

merged 4 commits into from
May 10, 2022

Conversation

rphillips
Copy link
Collaborator

@rphillips rphillips commented Apr 28, 2022

Added support for OOM detection. Need to figure out how to test it.

  • cgroup v2
  • cgroup v1
  • add oom exit paths

Fixes #297

@codecov-commenter
Copy link

codecov-commenter commented Apr 28, 2022

Codecov Report

Merging #365 (5d30fbf) into main (7bcb7e2) will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff          @@
##            main    #365   +/-   ##
=====================================
  Coverage   6.89%   6.89%           
=====================================
  Files          2       2           
  Lines         29      29           
  Branches      17      17           
=====================================
  Hits           2       2           
  Misses        27      27           

@rphillips rphillips force-pushed the oom_detection branch 4 times, most recently from 4203d05 to 86dd940 Compare April 28, 2022 19:03
@rphillips rphillips changed the title WIP: oom detection oom detection Apr 28, 2022
@rphillips rphillips force-pushed the oom_detection branch 2 times, most recently from 2910d88 to 8e5391e Compare April 28, 2022 20:23
@rphillips rphillips changed the title oom detection WIP: oom detection Apr 28, 2022
@rphillips
Copy link
Collaborator Author

going to fix the cgroup v1 flake.

@rphillips rphillips force-pushed the oom_detection branch 3 times, most recently from 410ebb7 to 736a291 Compare April 28, 2022 21:28
@rphillips rphillips changed the title WIP: oom detection oom detection Apr 28, 2022
@rphillips
Copy link
Collaborator Author

Fixed... I had to import the tokio-eventfd library to work with cgroupv1 events.

@rphillips
Copy link
Collaborator Author

Going to integrate a test and figure out cgroup v1... Something is still wrong.

@rphillips rphillips changed the title oom detection WIP: oom detection Apr 29, 2022
conmon-rs/server/src/child_reaper.rs Outdated Show resolved Hide resolved
conmon-rs/server/src/oom_watcher.rs Outdated Show resolved Hide resolved
conmon-rs/server/src/oom_watcher.rs Outdated Show resolved Hide resolved
conmon-rs/server/src/oom_watcher.rs Outdated Show resolved Hide resolved
@rphillips rphillips force-pushed the oom_detection branch 6 times, most recently from 9e39623 to a81b77a Compare April 29, 2022 15:31
@rphillips rphillips changed the title WIP: oom detection oom detection Apr 29, 2022
@rphillips
Copy link
Collaborator Author

I added all the requested fixes. We will need much more scaffolding to test OOMs within conmon-rs. I think we should rely on crio's test for now and add a TODO to natively test OOMs with this repo.

Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two nits, otherwise LGTM :)

conmon-rs/server/src/oom_watcher.rs Outdated Show resolved Hide resolved
conmon-rs/server/src/child_reaper.rs Outdated Show resolved Hide resolved
@rphillips rphillips force-pushed the oom_detection branch 4 times, most recently from 95b2413 to 6d42b79 Compare May 2, 2022 18:21
@rphillips
Copy link
Collaborator Author

@saschagrunert can you try this on a cgroupv1 machine and see if it fails for you?

@@ -58,6 +58,7 @@ var _ = Describe("ConmonClient", func() {
})

It(testName("should write exit file", terminal), func() {
Skip("")
Copy link
Member

@saschagrunert saschagrunert May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be removed, right? :)

If I re-enable the test then the Expect(fileContents(tr.exitPath())).To(Equal("0")) does fail below.

@saschagrunert
Copy link
Member

@saschagrunert can you try this on a cgroupv1 machine and see if it fails for you?

Hm, on Ubuntu 18.04 the setup seems to work:

root@ubuntu1804:~/conmon-rs# ginkgo run -v -r --focus="should write exit file$"
Running Suite: ConmonClient - /root/conmon-rs/pkg/client
========================================================
Random Seed: 1651567071

Will run 1 of 25 specs
SSSSSSSS
------------------------------
ConmonClient CreateContainer
  should write exit file
  /root/conmon-rs/pkg/client/client_test.go:60
2022-05-03T08:37:52.689Z INFO [conmon::server] Using stdout logger
2022-05-03T08:37:52.689Z INFO [conmon::server] Set log level to: DEBUG
2022-05-03T08:37:52.694Z DEBUG [conmon::rpc] Got a version request
2022-05-03T08:37:52.697Z DEBUG [conmon::rpc] Got a create container request for id 145ac9c38327b45b8eb09b9823e2b71e8c43812226ef8abf2ba7d0b463cefb95
2022-05-03T08:37:52.697Z DEBUG [conmon::streams] Creating new IO streams
2022-05-03T08:37:52.698Z DEBUG [conmon::rpc] PID file is /tmp/conmon-client954065182/pidfile
2022-05-03T08:37:52.698Z DEBUG [conmon::server] Runtime args "--root=/tmp/conmon-client954065182/root create --bundle /tmp/conmon-client954065182 --pid-file /tmp/conmon-client954065182/pidfile 145ac9c38327b45b8eb09b9823e2b71e8c43812226ef8abf2ba7d0b463cefb95"
2022-05-03T08:37:52.698Z DEBUG [conmon::cri_logger] Initializing CRI logger in path /tmp/conmon-client954065182/log
2022-05-03T08:37:52.699Z DEBUG [conmon::streams] Start reading from IO streams
2022-05-03T08:37:52.845Z DEBUG [conmon::oom_watcher] 7611: enabled cgroup v1 oom handling
2022-05-03T08:37:52.845Z DEBUG [conmon::oom_watcher] 7611: using cgroup path : /proc/7611/cgroup
2022-05-03T08:37:52.850Z DEBUG [conmon::oom_watcher] 7611: memory_cgroup_file_oom_path /sys/fs/cgroup/memory/user.slice/145ac9c38327b45b8eb09b9823e2b71e8c43812226ef8abf2ba7d0b463cefb95/memory.oom_control
2022-05-03T08:37:52.852Z DEBUG [conmon::oom_watcher] 7611: successfully setup cgroup v1 oom detection
2022-05-03T08:37:52.881Z DEBUG [conmon::container_io] fd:14:read 21 bytes
2022-05-03T08:37:52.910Z DEBUG [conmon::container_io] fd:16:No more to read
2022-05-03T08:37:52.911Z DEBUG [conmon::container_io] fd:14:No more to read
VmRSS for server is 10592
•SSSSSSSSSSSSSSSS

Ran 1 of 25 Specs in 0.255 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 24 Skipped
PASS
2022-05-03T08:37:52.940Z INFO [conmon::server] Received SIGINT
2022-05-03T08:37:52.940Z DEBUG [conmon::child_reaper] 7611: killing pid
2022-05-03T08:37:52.940Z DEBUG [conmon::child_reaper] Failed killing pid 7611 with error ESRCH: No such process
2022-05-03T08:37:52.940Z DEBUG [conmon::server] Removing socket file /tmp/conmon-client954065182/conmon.sock
2022-05-03T08:37:52.940Z DEBUG [conmon::oom_watcher] 7611: done watching for ooms

@rphillips
Copy link
Collaborator Author

This is ready now... phew.

@@ -7,7 +7,7 @@ on:
env:
CARGO_TERM_COLOR: always
GO_VERSION: '1.18'
ACTION_MSRV_TOOLCHAIN: 1.59.0
ACTION_MSRV_TOOLCHAIN: 1.60.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this toolchain available on rhcos?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cgwalters if you know

@@ -143,7 +143,7 @@ impl ChildReaper {

pub fn kill_grandchildren(&self, s: Signal) -> Result<()> {
for (_, grandchild) in self.grandchildren.lock().unwrap().iter() {
debug!("{}: killing grandchild", grandchild.pid);
debug!(grandchild.pid, "killing grandchild");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this work got mixed with the toolchain bump, can we separate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even prefer we just merge this stuff with the original commit, as the git tree will look as though span support was added before this anyway

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed that up.

@haircommander
Copy link
Collaborator

one small nit, otherwise LGTM

what about 1.60 helped us?

great work on this, it's been a long haul!

Signed-off-by: Ryan Phillips <[email protected]>
Signed-off-by: Ryan Phillips <[email protected]>
Signed-off-by: Ryan Phillips <[email protected]>
@rphillips
Copy link
Collaborator Author

@haircommander We hit this problem with the PR: rust-lang/cargo#10623. The only thing I could find to do is bump the version.

@haircommander
Copy link
Collaborator

LGTM, summoning @saschagrunert for the merge

Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I just see some unwraps and can follow-up on them. Great work @rphillips!

@saschagrunert saschagrunert merged commit 8868738 into containers:main May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Catch container OOM
4 participants