Revamp Testing Infrastructure and run Multi-Kernel Tests in CI (#111)

* Fix invalid enum relocation * Improve error logging on JSON unmarshal failure EventsTrace was going into a print "\n" loop on a misconfigured kernel, which was hard to diagnose with the current error logging. * Fix broken probe_set_features logic See comment, libbpf_probe_bpf_prog_type will return true on both aarch64 and x86_64 as it checks for the ability to _load_, not the ability to _attach_, which is what will fail on aarch64. * Rework multi-kernel-tester and run in CI - Move the core initramfs and init process logic to Bluebox - Set up the whole thing to run in CI with mainline kernels v5.11-v5.18 - Redo testing/README.md for new changes * Add debug build target * Clarify format / test-format targets, format code * Disable fail-fast in multikernel test action * Clarify and cleanup formatting/build CI workflow * Dump contents of tracefs trace file on test fail This file contains the result of all bpf_prink's, which often encompasses a wealth of information when probes fail to do something. * Fix typo in README.md Co-authored-by: Nicholas Berlin <[email protected]> * Fix test failures on Linux 5.11/aarch64 We were attempting to dereference a struct dentry into a struct dentry *, which resulted in garbage data. Add FUNC_ARG_READ_PTREGS_NODEREF to skip the dereference step and use it to fix the test failures. * Fix broken build-debug target in Makefile Additionally, remove nonexistent .PHONY targets and simplify build-debug by removing _internal-build-debug target. * Remove -DENABLE_BPF_PRINTK from CI builds Enabling bpf_printk has a huge performance penalty in the emulated VMs running in CI. Running aarch64 tests on Linux 5.11 without bpf_printk takes ~1m30s while the same run without bpf_printk takes ~2m10s. While this is larger than I'd expect, I can see it happening on an emulated VM, every bpf_printk call corresponds to a lot of extra instructions that have to be emulated. Disable the flag. This reduces the debug info present in CI artifacts, but if tests fail, devs should be able to run the failing test locally with -DENABLE_BPF_PRINTK to get debug output. * Improve logging on test failure - Log stacktrace to stdout so logs are all serialized - Log banner warning that the trace file is always empty on CI - I can see this eating a day of a dev's time otherwise :P * Fix BPF tramp detection, add multi-kernel tests This logic should properly detect a lack of BPF trampoline support on x86 kernels that don't have it enabled (unlike the previous simple arch check). It's also incredibly subject to change in a new kernel release (what if taskstats_exit dissapears?), so add a multi-kernel test for feature probing. This should be nicely extensible to other features in the future. * Document unintuitive API * Add comment to TestFeaturesCorrect RE: x86 * Dockerize code formatting Makefile target Differences in clang-format versions were causing a headache here. clang-format 13 was fine with the code as-is, but clang-format 11 would make this change: diff --git a/non-GPL/TcLoader/TcLoader.c b/non-GPL/TcLoader/TcLoader.c index 9a8bc0c..1ed7c1d 100644 --- a/non-GPL/TcLoader/TcLoader.c +++ b/non-GPL/TcLoader/TcLoader.c @@ -371,10 +371,10 @@ static int netlink_qdisc(int cmd, unsigned int flags, const char *ifname) int rv = -1; struct rtnetlink_handle qdisc_rth = {.fd = -1}; struct netlink_msg qdisc_req = { - .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)), - .n.nlmsg_flags = NLM_F_REQUEST | flags, - .n.nlmsg_type = cmd, - .t.tcm_family = AF_UNSPEC, + .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)), + .n.nlmsg_flags = NLM_F_REQUEST | flags, + .n.nlmsg_type = cmd, + .t.tcm_family = AF_UNSPEC, }; if (!ifname) { This seems to be a result of the AlignConsecutiveAssignments directive in .clang-format (removing it makes the two versions agree). Dockerize the format step to cleanly solve the problem for all such cases in the future. * Fix incorrect docker tag s/_TAG/_PULL_TAG/g Should be using the remote instead of the local. * Fix clang-format not running in a container find was running in the container, clang-format was not due to how the command was being interpreted aargh. * Fix incorrect script args in comment Co-Authored-By: Florian Lehner <[email protected]> * Remove time estimate in README Co-Authored-By: Florian Lehner <[email protected]> * Cleanup run_tests.sh arguments -d really should have allowed users to pass individual images if they want something more specific tested. Also remove old information from usage text and update for new -k arg. * Add note about debootstrap to builder script Co-Authored-By: Florian Lehner <[email protected]> * Fix improper getopts usage An option cannot have > 1 argument. Just put the list kernel images at the end (no specific flag required). * Fix incorrect variable name * Fix faulty bash list logic * Change BPFTOOL_VERSION to LINUX_TOOLS_VERSION Co-Authored-By: Florian Lehner <[email protected]> * Add gen_initramfs.sh script, update debug docs * Greatly clean up bash scripts - Use readonly/local - Make conditionals more concise with || and && - Pull everything into functions, no naked code - Standardize on lowercase for local vars - Make command line arg format consistent - Rename run_single_test.sh to invoke_qemu.sh, pull out all test logic for better separation of concerns - Update README.md to suggest usage of invoke_qemu.sh instead of directly invoking QEMU - Add ~/go/bin to PATH in bash instead of workflow YAML - Greatly clean up GNU parallel invocation * Fix caching logic in GH actions workflow Co-Authored-By: Lovel Rishi <[email protected]> * Fix inspecific restore key We could have restored the set of aarch64 kernels in the x86 tests with the workflow as it was. * Remove reduntant apt-get update Co-Authored-By: Lovel Rishi <[email protected]> * Cleanup Makefile - Add tag-container target and cleanup / separate docker related vars - Remove DOCKER_IMG_UBUNUT_VERSION * Fix broken test-format target * Add section to README.md on userspace debugging Co-Authored-By: Florian Lehner <[email protected]> * Add missing arg to gen_initramfs.sh help * Remove KVM section from README.md We're not currently using it in CI, and all dev work on my end has been done without it. We can discuss enabling it if/when we start using a custom runner, but for now, remove the clutter. * Fixups to scripts/invoke_qemu.sh - Make KVM probing an option, turned off by default (it seems to break debugging on my machine, and provides no real benefit anyways nested in a VBox VM) - Only provide one -append flag to QEMU, it will ignore all but the last, causing the -d option to not work * Add missing -o to find invocation in Makefile Co-authored-by: Nicholas Berlin <[email protected]> Co-authored-by: Florian Lehner <[email protected]> Co-authored-by: Lovel Rishi <[email protected]>
elastic · Jul 21, 2022 · e9e5d74 · e9e5d74
1 parent a7e8fab
commit e9e5d74
Show file tree

Hide file tree

Showing 26 changed files with 1,219 additions and 713 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
diff --git a/.github/workflows/formatting-build.yml b/.github/workflows/formatting-build.yml
@@ -0,0 +1,47 @@
+name: Formatting and Build
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
+jobs:
+  formatting-and-build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - name: Test code formatting
+        run: make test-format
+      - name: Cache hash
+        run: echo "GITHUB_CACHE_HASH=$(md5sum Makefile docker/Dockerfile.builder | md5sum | awk '{print $1}')" >> $GITHUB_ENV
+      - name: Setup docker cache
+        uses: actions/cache@v3
+        with:
+          path: ~/.cache/docker
+          key: ${{ env.GITHUB_CACHE_HASH }}-v0
+          restore-keys: |
+            ${{ env.GITHUB_CACHE_HASH }}-v0
+      - name: Configure Docker
+        run: |
+          sudo systemctl stop docker
+          sudo rm -rf /var/lib/docker
+          sudo mkdir -p /var/lib/docker
+          sudo mkdir -p ~/.cache/docker
+          sudo chown -fR root:root ~/.cache/docker
+          sudo mount --rbind ~/.cache/docker /var/lib/docker
+          sudo systemctl start docker
+      - name: Test build
+        run: make build
+      - name: Test for source differences post-build
+        run: git diff --exit-code
+      - name: Test container image build
+        run: make container
+      - name: Fix permissions (last step)
+        run: |
+          docker system prune -f
+          sudo systemctl stop docker
+          sudo umount /var/lib/docker
+          sudo chown -R $USER:$USER ~/.cache/docker
+          sudo rm -rf ~/.cache/docker/volumes/backingFsBlockDev
+          sudo find ~/.cache/docker -name work | sudo xargs chmod -R 700
diff --git a/.github/workflows/multikernel-tester.yml b/.github/workflows/multikernel-tester.yml
@@ -0,0 +1,75 @@
+name: Multi Kernel Testing
+on: [ pull_request ]
+
+jobs:
+  multikernel_tester:
+    strategy:
+      # If a failure occurs, run the other arches/distros to the end. It's useful to see if it
+      # occurs on other kernels as well
+      fail-fast: false
+      matrix:
+        kernel_flavor: [ mainline ]
+        arch: [ x86_64, aarch64 ]
+    runs-on: ubuntu-latest
+    steps:
+    - name: Checkout repository
+      uses: actions/checkout@v2
+    - name: Build eBPF probes and userspace components
+      run: make build ARCH=${{ matrix.arch }}
+    - name: Auth GCP
+      uses: 'google-github-actions/auth@v0'
+      with:
+        credentials_json: '${{ secrets.ACTIONS_GCP_JSON_CREDENTIALS }}'
+    - name: 'Setup gcloud'
+      uses: 'google-github-actions/setup-gcloud@v0'
+    - name: Create kernel images directory
+      run: |
+        sudo mkdir -p /kernel-images
+        sudo chown -fR $USER:$USER /kernel-images
+    - name: Get current timestamp for GCS cache key
+      run: echo "TIMESTAMP=$(date +%s)" >> $GITHUB_ENV
+    - name: Setup GCS cache
+      id: cache
+      uses: actions/cache@v3
+      with:
+        path: /kernel-images
+        key: gcs-cache-${{ matrix.kernel_flavor }}-${{ matrix.arch }}-${{ env.TIMESTAMP }}
+        restore-keys: gcs-cache-${{ matrix.kernel_flavor }}-${{ matrix.arch }}
+    - name: Rsync kernel images from GCS
+      run: gsutil -m rsync -d -r gs://ebpf-ci-kernel-images/${{ matrix.kernel_flavor }}/${{ matrix.arch }}/images /kernel-images/
+    - name: Install packages needed for testing
+      run: |
+        sudo apt-get update
+        sudo apt-get install -y --no-install-recommends \
+        gcc-aarch64-linux-gnu \
+        libc6-dev-arm64-cross \
+        parallel \
+        qemu-system-x86 \
+        qemu-system-arm
+      env:
+        DEBIAN_FRONTEND: noninteractive
+    - name: Install Go
+      uses: actions/setup-go@v3
+      with:
+        go-version: '1.17'
+    - name: Install Bluebox
+      run: go install github.com/florianl/bluebox@b8590fb1850f56df6e6d7786931fcabdc1e9173d
+    - name: Run tests
+      working-directory: testing
+      run: |
+        ./run_tests.sh \
+        ${{ matrix.arch }} \
+        ../artifacts-${{ matrix.arch }}/non-GPL/EventsTrace/EventsTrace \
+        /kernel-images/*
+    - name: Archive test summary
+      if: always()
+      uses: actions/upload-artifact@v3
+      with:
+        name: run-summary-${{ matrix.kernel_flavor }}-${{ matrix.arch }}.txt
+        path: testing/bpf-check-summary.txt
+    - name: Archive individual test results
+      if: always()
+      uses: actions/upload-artifact@v3
+      with:
+        name: results-${{ matrix.kernel_flavor }}-${{ matrix.arch }}
+        path: testing/results
diff --git a/GPL/EventProbe/CMakeLists.txt b/GPL/EventProbe/CMakeLists.txt
@@ -7,7 +7,7 @@
 
 # BPF program
 
-if (CMAKE_BUILD_TYPE STREQUAL Debug)
+if (CMAKE_BUILD_TYPE STREQUAL Debug OR ENABLE_BPF_PRINTK)
     set(BPF_DEBUG_TRACE 1)
 else()
     set(BPF_DEBUG_TRACE 0)

diff --git a/GPL/EventProbe/File/Probe.bpf.c b/GPL/EventProbe/File/Probe.bpf.c
@@ -329,12 +329,12 @@ int BPF_KPROBE(kprobe__vfs_rename)
         new_dentry = rd.new_dentry;
     } else {
         /* Dentries are accessible from ctx */
-        err = FUNC_ARG_READ_PTREGS(old_dentry, vfs_rename, old_dentry);
+        err = FUNC_ARG_READ_PTREGS_NODEREF(old_dentry, vfs_rename, old_dentry);
         if (err) {
             bpf_printk("kprobe__vfs_rename: error reading old_dentry\n");
             return 0;
         }
-        err = FUNC_ARG_READ_PTREGS(new_dentry, vfs_rename, new_dentry);
+        err = FUNC_ARG_READ_PTREGS_NODEREF(new_dentry, vfs_rename, new_dentry);
         if (err) {
             bpf_printk("kprobe__vfs_rename: error reading new_dentry\n");
             return 0;

diff --git a/GPL/EventProbe/Helpers.h b/GPL/EventProbe/Helpers.h
@@ -28,6 +28,56 @@
         _ret;                                                                                      \
     })
 
+/*
+ *  Reads the specified argument from struct pt_regs without dereferencing it
+ *  (unlike FUNC_ARG_READ_PTREGS) (i.e. we get a  pointer to the argument, not
+ *  the argument itself). Note that we first have to read the value in struct
+ *  pt_regs into a volatile temporary (_dst). Without this, LLVM can generate
+ *  code like the following, which will fail to verify:
+ *
+ *  r3 = 8                      # The register value we want to read is at offset 8 in the context
+ *  r2 = r1                     # r1 = ctx pointer
+ *  r2 += r3                    # Increment ctx ptr to register value we're interested in
+ *  r3 = *(u64 *)(r2 +0)        # Dereference it (fail)
+ *  dereference of modified ctx ptr R2 off=8 disallowed
+ *
+ *  The verifier disallows dereferencing the context pointer when it's been
+ *  modified. This will often happen as an inlining optimization if dst is
+ *  immediately passed into a function. We instead want code like the following
+ *  to be generated:
+ *
+ *  r2 = r1                     # r1 = ctx pointer
+ *  r3 = *(u64 *)(r2 + 8)       # Dereference it, putting the increment in the dereference insn
+ *  ...pass r3 to a function
+ */
+#define FUNC_ARG_READ_PTREGS_NODEREF(dst, func, arg)                                               \
+    ({                                                                                             \
+        int ret = 0;                                                                               \
+        volatile typeof(dst) _dst;                                                                 \
+        switch (arg__##func##__##arg##__) {                                                        \
+        case 0:                                                                                    \
+            _dst = (typeof(dst))PT_REGS_PARM1(ctx);                                                \
+            break;                                                                                 \
+        case 1:                                                                                    \
+            _dst = (typeof(dst))PT_REGS_PARM2(ctx);                                                \
+            break;                                                                                 \
+        case 2:                                                                                    \
+            _dst = (typeof(dst))PT_REGS_PARM3(ctx);                                                \
+            break;                                                                                 \
+        case 3:                                                                                    \
+            _dst = (typeof(dst))PT_REGS_PARM4(ctx);                                                \
+            break;                                                                                 \
+        case 4:                                                                                    \
+            _dst = (typeof(dst))PT_REGS_PARM5(ctx);                                                \
+            break;                                                                                 \
+        default:                                                                                   \
+            ret = -1;                                                                              \
+        };                                                                                         \
+        dst = _dst;                                                                                \
+        barrier();                                                                                 \
+        ret;                                                                                       \
+    })
+
 #define FUNC_ARG_READ_PTREGS(dst, func, arg)                                                       \
     ({                                                                                             \
         int ret = 0;                                                                               \

diff --git a/GPL/EventProbe/PathResolver.h b/GPL/EventProbe/PathResolver.h
@@ -238,7 +238,21 @@ static void ebpf_resolve_kernfs_node_to_string(char *buf, struct kernfs_node *kn
 
 static void ebpf_resolve_pids_ss_cgroup_path_to_string(char *buf, const struct task_struct *task)
 {
-    struct kernfs_node *kn = BPF_CORE_READ(task, cgroups, subsys[pids_cgrp_id], cgroup, kn);
+    /*
+     * Since pids_cgrp_id is an enum value, we need to get it at runtime as it
+     * can change kernel-to-kernel depending on the kconfig or possibly not be
+     * enabled at all.
+     */
+    int cgrp_id;
+    if (bpf_core_enum_value_exists(enum cgroup_subsys_id, pids_cgrp_id)) {
+        cgrp_id = bpf_core_enum_value(enum cgroup_subsys_id, pids_cgrp_id);
+    } else {
+        /* Pids cgroup is not enabled on this kernel */
+        buf[0] = '\0';
+        return;
+    }
+
+    struct kernfs_node *kn = BPF_CORE_READ(task, cgroups, subsys[cgrp_id], cgroup, kn);
     ebpf_resolve_kernfs_node_to_string(buf, kn);
 }
 

diff --git a/Makefile b/Makefile
@@ -1,39 +1,79 @@
 ARCH ?= $(shell arch)
-BUILD_DIR ?= artifacts-${ARCH}
-SUDO ?= 
-PWD ?= $(shell pwd)
-CONTAINER_RUNTIME ?= docker
-DOCKER_IMG_UBUNTU_VERSION ?= jammy
-BUILDER_PULL_TAG ?= us-docker.pkg.dev/elastic-security-dev/ebpf-public/builder:20220620-0715
-BUILDER_TAG ?= us-docker.pkg.dev/elastic-security-dev/ebpf-public/builder:${USER}-latest
-C_INCLUDE_PATH ?=
-DOCKER_CACHE ?=
 
+DOCKER_IMAGE = us-docker.pkg.dev/elastic-security-dev/ebpf-public/builder
+DOCKER_PULL_TAG = 20220711-1742
+DOCKER_LOCAL_TAG = ${USER}-latest
+CURRENT_DATE_TAG = $(shell date +%Y%m%d-%H%M)
 
-.PHONY = build build-local clean container fix-permissions format test-format
+PWD = $(shell pwd)
+BUILD_DIR = artifacts-${ARCH}
+CMAKE_FLAGS = -DARCH=${ARCH} -DBUILD_STATIC_EVENTSTRACE=True -DUSE_BUILTIN_VMLINUX=True -B${BUILD_DIR} -S${PWD}
 
-container:
-	${CONTAINER_RUNTIME} build ${DOCKER_CACHE} -t ${BUILDER_TAG} --build-arg PULL_TAG=${DOCKER_IMG_UBUNTU_VERSION} -f docker/Dockerfile.builder .
+# Directories to search recursively for c/cpp source files to clang-format
+FORMAT_DIRS = GPL/ non-GPL/ testing/test_bins
+
+.PHONY = build build-debug _internal-build clean container format test-format
+
+# Kludge to get around a missing header. If we don't do this, we'll get the following error when
+# building:
+#
+# In file included from /home/vagrant/ebpf/contrib/libbpf/include/uapi/linux/bpf.h:11:
+# In file included from /home/vagrant/ebpf/contrib/libbpf/include/linux/types.h:8:
+# In file included from /usr/lib/llvm-14/lib/clang/14.0.0/include/stdint.h:52:
+# /usr/include/stdint.h:26:10: fatal error: 'bits/libc-header-start.h' file not found
+# include <bits/libc-header-start.h>
+#        ^~~~~~~~~~~~~~~~~~~~~~~~~~
+#
+# The HostIsolation probes include linux/bpf.h (copied into the libbpf repo) which includes
+# linux/types.h (also copied into the libbpf repo) which includes stdint.h. The clang stdint.h
+# includes bits/libc-header-start.h which is not in our include path. The correct one to use
+# depends on which arch we're compiling for.
+ifeq ($(ARCH),x86_64)
+	export C_INCLUDE_PATH = /usr/include/x86_64-linux-gnu
+else
+	export C_INCLUDE_PATH = /usr/aarch64-linux-gnu/include
+endif
 
-build-local:
-	mkdir -p ${BUILD_DIR}
-	C_INCLUDE_PATH=${C_INCLUDE_PATH} cmake -DUSE_BUILTIN_VMLINUX=True -B${BUILD_DIR} -S${PWD}
-	C_INCLUDE_PATH=${C_INCLUDE_PATH} make -C${BUILD_DIR}
+export CC=${ARCH}-linux-gnu-gcc
+export CXX=${ARCH}-linux-gnu-g++
+export AR=${ARCH}-linux-gnu-ar
+export LD=${ARCH}-linux-gnu-ld
 
 build:
-	docker run --rm -v${PWD}:${PWD} -w${PWD} ${BUILDER_PULL_TAG} 
+	docker run --rm -v${PWD}:${PWD} -w${PWD} ${DOCKER_IMAGE}:${DOCKER_PULL_TAG} \
+		/usr/bin/env make _internal-build ARCH=${ARCH} EXTRA_CMAKE_FLAGS=${EXTRA_CMAKE_FLAGS}
+	sudo chown -fR ${USER}:${USER} ${BUILD_DIR}
 	@echo "\n++ Build Successful at `date` ++\n"
 
-fix-permissions:
+# Convenience target to pass -DCMAKE_BUILD_TYPE=Debug and -DCMAKE_C_FLAGS="-g -O0"
+build-debug:
+	docker run --rm -v${PWD}:${PWD} -w${PWD} ${DOCKER_IMAGE}:${DOCKER_PULL_TAG} \
+		/usr/bin/env make _internal-build ARCH=${ARCH} EXTRA_CMAKE_FLAGS='-DCMAKE_BUILD_TYPE=Debug -DCMAKE_C_FLAGS="-g -O0"'
 	sudo chown -fR ${USER}:${USER} ${BUILD_DIR}
+	@echo "\n++ Build Successful at `date` ++\n"
+
+_internal-build:
+	mkdir -p ${BUILD_DIR}/
+	cmake ${EXTRA_CMAKE_FLAGS} ${CMAKE_FLAGS}
+	make -C${BUILD_DIR} -j$(shell nproc)
+
+container:
+	docker build -t ${DOCKER_LOCAL_TAG} -f docker/Dockerfile.builder .
+
+tag-container:
+	docker tag ${DOCKER_IMAGE}:${DOCKER_LOCAL_TAG} ${DOCKER_IMAGE}:$CURRENT_DATE_TAG
+	@echo "\n++ Tagged image as ${DOCKER_IMAGE}:${CURRENT_DATE_TAG} ++\n"
 
+# We dockerize code formatting because differences in clang-format versions can
+# lead to different formatting decisions. This way, everyone is using
+# clang-format 14 (default in the Ubuntu jammy repos).
 format:
-	find . \( -path ./contrib -o -path ./artifacts* \) -prune \
-	-o -name "*.c" -o -name "*.cpp" -o -name "*.h" -print | xargs /usr/bin/env clang-format -i
+	docker run --rm -v${PWD}:${PWD} -w${PWD} ${DOCKER_IMAGE}:${DOCKER_PULL_TAG} \
+		sh -c 'find ${FORMAT_DIRS} -name "*.cpp" -o -name "*.c" -o -name "*.h" -o -name "*.cpp" | xargs /usr/bin/env clang-format -i'
 
 test-format:
-	find . \( -path ./contrib -o -path ./artifacts* \) -prune \
-	-o -name "*.c" -o -name "*.cpp" -o -name "*.h" -print | xargs /usr/bin/env clang-format --dry-run -Werror
+	docker run --rm -v${PWD}:${PWD} -w${PWD} ${DOCKER_IMAGE}:${DOCKER_PULL_TAG} \
+		sh -c 'find ${FORMAT_DIRS} -name "*.cpp" -o -name "*.c" -o -name "*.h" -o -name "*.cpp" | xargs /usr/bin/env clang-format -i --dry-run -Werror'
 
 clean:
-	${SUDO} rm -rf artifacts-*
+	sudo rm -rf artifacts-*