-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp Testing Infrastructure and run Multi-Kernel Tests in CI #111
Conversation
EventsTrace was going into a print "\n" loop on a misconfigured kernel, which was hard to diagnose with the current error logging.
See comment, libbpf_probe_bpf_prog_type will return true on both aarch64 and x86_64 as it checks for the ability to _load_, not the ability to _attach_, which is what will fail on aarch64.
- Move the core initramfs and init process logic to Bluebox - Set up the whole thing to run in CI with mainline kernels v5.11-v5.18 - Redo testing/README.md for new changes
This file contains the result of all bpf_prink's, which often encompasses a wealth of information when probes fail to do something.
b53a9a4
to
dbbfe32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. A couple of comments.
Co-authored-by: Nicholas Berlin <[email protected]>
We were attempting to dereference a struct dentry into a struct dentry *, which resulted in garbage data. Add FUNC_ARG_READ_PTREGS_NODEREF to skip the dereference step and use it to fix the test failures.
Additionally, remove nonexistent .PHONY targets and simplify build-debug by removing _internal-build-debug target.
Enabling bpf_printk has a huge performance penalty in the emulated VMs running in CI. Running aarch64 tests on Linux 5.11 without bpf_printk takes ~1m30s while the same run without bpf_printk takes ~2m10s. While this is larger than I'd expect, I can see it happening on an emulated VM, every bpf_printk call corresponds to a lot of extra instructions that have to be emulated. Disable the flag. This reduces the debug info present in CI artifacts, but if tests fail, devs should be able to run the failing test locally with -DENABLE_BPF_PRINTK to get debug output.
- Log stacktrace to stdout so logs are all serialized - Log banner warning that the trace file is always empty on CI - I can see this eating a day of a dev's time otherwise :P
This logic should properly detect a lack of BPF trampoline support on x86 kernels that don't have it enabled (unlike the previous simple arch check). It's also incredibly subject to change in a new kernel release (what if taskstats_exit dissapears?), so add a multi-kernel test for feature probing. This should be nicely extensible to other features in the future.
Differences in clang-format versions were causing a headache here. clang-format 13 was fine with the code as-is, but clang-format 11 would make this change: diff --git a/non-GPL/TcLoader/TcLoader.c b/non-GPL/TcLoader/TcLoader.c index 9a8bc0c..1ed7c1d 100644 --- a/non-GPL/TcLoader/TcLoader.c +++ b/non-GPL/TcLoader/TcLoader.c @@ -371,10 +371,10 @@ static int netlink_qdisc(int cmd, unsigned int flags, const char *ifname) int rv = -1; struct rtnetlink_handle qdisc_rth = {.fd = -1}; struct netlink_msg qdisc_req = { - .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)), - .n.nlmsg_flags = NLM_F_REQUEST | flags, - .n.nlmsg_type = cmd, - .t.tcm_family = AF_UNSPEC, + .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)), + .n.nlmsg_flags = NLM_F_REQUEST | flags, + .n.nlmsg_type = cmd, + .t.tcm_family = AF_UNSPEC, }; if (!ifname) { This seems to be a result of the AlignConsecutiveAssignments directive in .clang-format (removing it makes the two versions agree). Dockerize the format step to cleanly solve the problem for all such cases in the future.
An option cannot have > 1 argument. Just put the list kernel images at the end (no specific flag required).
Co-Authored-By: Florian Lehner <[email protected]>
- Use readonly/local - Make conditionals more concise with || and && - Pull everything into functions, no naked code - Standardize on lowercase for local vars - Make command line arg format consistent - Rename run_single_test.sh to invoke_qemu.sh, pull out all test logic for better separation of concerns - Update README.md to suggest usage of invoke_qemu.sh instead of directly invoking QEMU - Add ~/go/bin to PATH in bash instead of workflow YAML - Greatly clean up GNU parallel invocation
Co-Authored-By: Lovel Rishi <[email protected]>
We could have restored the set of aarch64 kernels in the x86 tests with the workflow as it was.
Co-Authored-By: Lovel Rishi <[email protected]>
- Add tag-container target and cleanup / separate docker related vars - Remove DOCKER_IMG_UBUNUT_VERSION
Co-Authored-By: Florian Lehner <[email protected]>
We're not currently using it in CI, and all dev work on my end has been done without it. We can discuss enabling it if/when we start using a custom runner, but for now, remove the clutter.
- Make KVM probing an option, turned off by default (it seems to break debugging on my machine, and provides no real benefit anyways nested in a VBox VM) - Only provide one -append flag to QEMU, it will ignore all but the last, causing the -d option to not work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 👍🏽 Great work Rhys!! Thanks for all the collab Florian!
* Fix invalid enum relocation * Improve error logging on JSON unmarshal failure EventsTrace was going into a print "\n" loop on a misconfigured kernel, which was hard to diagnose with the current error logging. * Fix broken probe_set_features logic See comment, libbpf_probe_bpf_prog_type will return true on both aarch64 and x86_64 as it checks for the ability to _load_, not the ability to _attach_, which is what will fail on aarch64. * Rework multi-kernel-tester and run in CI - Move the core initramfs and init process logic to Bluebox - Set up the whole thing to run in CI with mainline kernels v5.11-v5.18 - Redo testing/README.md for new changes * Add debug build target * Clarify format / test-format targets, format code * Disable fail-fast in multikernel test action * Clarify and cleanup formatting/build CI workflow * Dump contents of tracefs trace file on test fail This file contains the result of all bpf_prink's, which often encompasses a wealth of information when probes fail to do something. * Fix typo in README.md Co-authored-by: Nicholas Berlin <[email protected]> * Fix test failures on Linux 5.11/aarch64 We were attempting to dereference a struct dentry into a struct dentry *, which resulted in garbage data. Add FUNC_ARG_READ_PTREGS_NODEREF to skip the dereference step and use it to fix the test failures. * Fix broken build-debug target in Makefile Additionally, remove nonexistent .PHONY targets and simplify build-debug by removing _internal-build-debug target. * Remove -DENABLE_BPF_PRINTK from CI builds Enabling bpf_printk has a huge performance penalty in the emulated VMs running in CI. Running aarch64 tests on Linux 5.11 without bpf_printk takes ~1m30s while the same run without bpf_printk takes ~2m10s. While this is larger than I'd expect, I can see it happening on an emulated VM, every bpf_printk call corresponds to a lot of extra instructions that have to be emulated. Disable the flag. This reduces the debug info present in CI artifacts, but if tests fail, devs should be able to run the failing test locally with -DENABLE_BPF_PRINTK to get debug output. * Improve logging on test failure - Log stacktrace to stdout so logs are all serialized - Log banner warning that the trace file is always empty on CI - I can see this eating a day of a dev's time otherwise :P * Fix BPF tramp detection, add multi-kernel tests This logic should properly detect a lack of BPF trampoline support on x86 kernels that don't have it enabled (unlike the previous simple arch check). It's also incredibly subject to change in a new kernel release (what if taskstats_exit dissapears?), so add a multi-kernel test for feature probing. This should be nicely extensible to other features in the future. * Document unintuitive API * Add comment to TestFeaturesCorrect RE: x86 * Dockerize code formatting Makefile target Differences in clang-format versions were causing a headache here. clang-format 13 was fine with the code as-is, but clang-format 11 would make this change: diff --git a/non-GPL/TcLoader/TcLoader.c b/non-GPL/TcLoader/TcLoader.c index 9a8bc0c..1ed7c1d 100644 --- a/non-GPL/TcLoader/TcLoader.c +++ b/non-GPL/TcLoader/TcLoader.c @@ -371,10 +371,10 @@ static int netlink_qdisc(int cmd, unsigned int flags, const char *ifname) int rv = -1; struct rtnetlink_handle qdisc_rth = {.fd = -1}; struct netlink_msg qdisc_req = { - .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)), - .n.nlmsg_flags = NLM_F_REQUEST | flags, - .n.nlmsg_type = cmd, - .t.tcm_family = AF_UNSPEC, + .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)), + .n.nlmsg_flags = NLM_F_REQUEST | flags, + .n.nlmsg_type = cmd, + .t.tcm_family = AF_UNSPEC, }; if (!ifname) { This seems to be a result of the AlignConsecutiveAssignments directive in .clang-format (removing it makes the two versions agree). Dockerize the format step to cleanly solve the problem for all such cases in the future. * Fix incorrect docker tag s/_TAG/_PULL_TAG/g Should be using the remote instead of the local. * Fix clang-format not running in a container find was running in the container, clang-format was not due to how the command was being interpreted aargh. * Fix incorrect script args in comment Co-Authored-By: Florian Lehner <[email protected]> * Remove time estimate in README Co-Authored-By: Florian Lehner <[email protected]> * Cleanup run_tests.sh arguments -d really should have allowed users to pass individual images if they want something more specific tested. Also remove old information from usage text and update for new -k arg. * Add note about debootstrap to builder script Co-Authored-By: Florian Lehner <[email protected]> * Fix improper getopts usage An option cannot have > 1 argument. Just put the list kernel images at the end (no specific flag required). * Fix incorrect variable name * Fix faulty bash list logic * Change BPFTOOL_VERSION to LINUX_TOOLS_VERSION Co-Authored-By: Florian Lehner <[email protected]> * Add gen_initramfs.sh script, update debug docs * Greatly clean up bash scripts - Use readonly/local - Make conditionals more concise with || and && - Pull everything into functions, no naked code - Standardize on lowercase for local vars - Make command line arg format consistent - Rename run_single_test.sh to invoke_qemu.sh, pull out all test logic for better separation of concerns - Update README.md to suggest usage of invoke_qemu.sh instead of directly invoking QEMU - Add ~/go/bin to PATH in bash instead of workflow YAML - Greatly clean up GNU parallel invocation * Fix caching logic in GH actions workflow Co-Authored-By: Lovel Rishi <[email protected]> * Fix inspecific restore key We could have restored the set of aarch64 kernels in the x86 tests with the workflow as it was. * Remove reduntant apt-get update Co-Authored-By: Lovel Rishi <[email protected]> * Cleanup Makefile - Add tag-container target and cleanup / separate docker related vars - Remove DOCKER_IMG_UBUNUT_VERSION * Fix broken test-format target * Add section to README.md on userspace debugging Co-Authored-By: Florian Lehner <[email protected]> * Add missing arg to gen_initramfs.sh help * Remove KVM section from README.md We're not currently using it in CI, and all dev work on my end has been done without it. We can discuss enabling it if/when we start using a custom runner, but for now, remove the clutter. * Fixups to scripts/invoke_qemu.sh - Make KVM probing an option, turned off by default (it seems to break debugging on my machine, and provides no real benefit anyways nested in a VBox VM) - Only provide one -append flag to QEMU, it will ignore all but the last, causing the -d option to not work * Add missing -o to find invocation in Makefile Co-authored-by: Nicholas Berlin <[email protected]> Co-authored-by: Florian Lehner <[email protected]> Co-authored-by: Lovel Rishi <[email protected]>
This PR revamps the multi-kernel test setup in line with concerns recently
brought up by obs-profiling, extends it to support arm64 tests, and sets the
whole thing up to run in CI via GitHub actions.
Background
Concerns were brought up by obs-profiling about the initial iteration of the
testing framework locking them into something that wasn't adequate with regards
to cross platform compatibility.
Specifically, obs-profiling wanted the core of the framework to be very easy to
run on different architectures generally (be it generating a test initramfs for
x86 on an arm64 Mac or running a test in an arm64 guest on an x86 host).
Practically speaking, this means writing the core components of the framework
in Go, as it's super easy to cross compile for different architectures and
platforms.
While these concerns were less of a problem for AWP, obs-profiling still wanted
to be able to share common logic pertaining to running BPF tests with us as to
follow the DRY principle. The concern was that either obs-profiling would have
to create something else from scratch, thus performing redundant work, or use a
framework that wasn't quite what they were looking for.
Upon discussion with @florianl, the common logic that both teams will need is
really just limited to initramfs generation and test binary running (i.e. the
init process in the VM). Anything beyond that (e.g. specifics of the tests,
driver scripts to invoke QEMU) becomes pretty team specific.
As a result, we decided that both teams can just
go install
hisBluebox project to provide this core
functionality of initramfs generation and a VM init process. Bluebox is written
in Go and easily compiles to run on arm64/x86 hosts or generate initramfs's for
arm64/x86 guests. Anything beyond what Bluebox provides is team-specific and
can be implemented by the team writing the tests.
This will maximise the amount of shared code between our testing setups (which
is good from a DRY standpoint) while also allowing each team to build what
works best for them on top of Bluebox.
This PR provides the AWP-side realiziation of that work, and also sets up the
whole thing to run in CI via GitHub actions.
Changes Made
All initramfs generation logic has been moved into Bluebox. Thus the
init/
directory and parts of
run_tests.sh
that were responsible for invokingcpio
to generate an initramfs have been removed. That job is now done by Bluebox.
The
make container
make target has been revamped to build a container thatcan cross-compile the repo for x86_64 and arm64. See
docker/Dockerfile.builder
. Building the repo for different architectures isnow as simple as
make ARCH=x86_64
ormake ARCH=aarch64
. Debug builds areanalgous, the target name is just
build-debug
.The
build_mainline_kernels.sh
script has been updated to build all arm64 andx86_64 kernel images we need, and output kernel binaries with debug info, so
kernel code can be stepped through with QEMU in case of a gnarly BPF error (see
the debugging section in the updated
testing/README.md
). These binaries arestored in the GCS bucket under the
debug_binaries
directory of eacharchitecture directory.
CI
As mentioned, a GitHub actions workflow has been added to run tests on multiple
kernels. It pulls kernel images from a GCS bucket and runs tests against them.
For this initial iteration, only mainline kernels (built with the
build_mainline_kernels.sh
script) are run. Distro-specific kernels can beeasily added to the test matrix later via an extra line in the yaml.
Kernel images are cached via the cache action and only re-downloaded from GCS
if the containing directory changes to save bandwith and GCS costs.
It takes ~45s per kernel to run the kernel tests in a GitHub actions runner.
x86_64 and arm64 tests are run in parallel, and each encompass 8 kernels
currently (v5.11 - v5.18), so it takes ~6m to run them all. Unfortunately the
GitHub actions runners don't support KVM, which would drastically speed this
up. In the future, we can look at a self-hosted runner with KVM support if we
want to be really aggressive about testing on many, many kernels.
Whether tests pass or fail, a summary of the run and individual results are
archived in GitHub actions (see the summary page).
results-mainline-{arch}
contains the console output for the tests run on each kernel and
run-summary-{arch}.txt
contains a summary of the pass/fail status of eachtest on that particular architecture.
Bugs
A handful of bugs were discovered and fixed to get the tests running on all
kernels (see commits).
You'll note only one test is still failing --
(Thank you @mmat11 for helping me squash the last bug 🚀 )TestFileRename
on v5.11/aarch64. I'm opening this PR to get feedback nowwhile I continue to try to fix that, as fixing it has become a bit of a time
sink.
CC: @norrietaylor @SeanHeelan @m-sample