tests: add apparmor prompting integration tests #14518

olivercalder · 2024-09-18T02:40:42Z

Port the prompting-client integration tests to snapd and extend them to cover common use cases for AppArmor Prompting.

This work is tracked internally by https://warthogs.atlassian.net/browse/SNAPDENG-30450.

olivercalder · 2024-09-24T13:19:17Z

Interesting that jammy tests failed but noble succeeded, and that the prompting-client integration tests #14387 pass on jammy. This suggests a problem with the test setup, rather than prompting support itself. The error would seem to suggest this as well, but I'll need to run with -debug to check.

Run the prompting client in scripted mode in the background
+ prompting-client.scripted --script=/home/test/integration-tests/tmp.shU6Z7onh1/script.json --grace-period=1 --var=BASE_PATH:/home/test/integration-tests/tmp.shU6Z7onh1
creating client
script path: /home/test/integration-tests/tmp.shU6Z7onh1/script.json
Error: Io(Os { code: 13, kind: PermissionDenied, message: "Permission denied" })
+ echo Attempt to write the file
Attempt to write the file
+ snap run --shell prompting-client.scripted -c echo it is written > /home/test/integration-tests/tmp.shU6Z7onh1/test.txt
/bin/bash: line 1: /home/test/integration-tests/tmp.shU6Z7onh1/test.txt: Permission denied

Edit: Yes, appears to be a difference in the way access to non-owned files in $HOME is treated by the home interface between jammy and noble. Fixed in commit below.

olivercalder · 2024-09-24T21:16:41Z

Also want to add a test which queues up multiple requests, then the client replies to the final one, and it checks whether the previous requests are now handled by the reply. But this requires being able to have prompt entries in the script which do not have replies, and this is not yet supported, I believe.

codecov · 2024-09-24T21:31:04Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.96%. Comparing base (96ea7b0) to head (f00813f).
Report is 15 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #14518      +/-   ##
==========================================
+ Coverage   78.95%   78.96%   +0.01%     
==========================================
  Files        1084     1084              
  Lines      146638   146709      +71     
==========================================
+ Hits       115773   115853      +80     
+ Misses      23667    23659       -8     
+ Partials     7198     7197       -1

Flag	Coverage Δ
unittests	`78.96% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

olivercalder · 2024-09-24T21:46:43Z

I think it's worth reviewing these as they are now, and either adding more cases here or opening follow-up PRs for more cases in the future.

Some remaining considerations for reviewers:

Would it be preferable to move the prompting-client.scripted call to task.yaml instead of invoking it in each .sh file?
- Sometimes the prompting-client call is expected to fail, so we would not want to universally check success in the result file... unless test cases which are expected to fail do manual checks of the result file and then mangle it to say success before returning
- I think in every case I can think of we'll want to at least invoke it at the start, and any test case setup can occur as root without interfering with/generating any prompts
Could test cases be moved to a loop within the test, rather than as distinct variants?
- Currently, the tests.session is created and restored for every variant, which takes a while
- If the tmpdir is created for each case, they should still stay isolated, that was the whole idea behind the top-level filter and using mktemp
Should test cases be autodetected to include any .sh file in the directory?

maykathm

Thanks! Overall, I think it looks great. To respond to your comments, I think it could be a little cleaner if, as you already mentioned, you moved the prompting-client.scripted call to the task level. Along that same vein, since each test has a custom action that it performs followed by a custom check for success, you could place "do action" and "check for success" logic for each test inside of functions that get called at the task level. That would slightly reduce code duplication since your wait conditions could also be moved to the task level. That said, I think the way you currently laid out the tests is just fine; I suppose I would more favor the change if you think many more tests will be added in the future.

maykathm · 2024-09-27T09:12:48Z

tests/main/apparmor-prompting-integration-tests/create_multiple_allow.sh

+	snap run --shell prompting-client.scripted -c "echo $name is written > ${TEST_DIR}/${name}"
+done
+
+sleep 5 # give the client a chance to write its result and exit


Nitpick: If possible, I would try to get rid of all sleeps throughout this PR. Sleeps introduce unnecessary delay and mask the true wait condition. What about something like timeout 5 bash -c 'while pgrep -f "prompting-client-scripted" > /dev/null; do true; done'?

I've replaced sleeps with timeouts like you suggested, and set the timeout duration as a variable in task.yaml which is passed into each test case script.

Thanks for the great suggestion about pgrep as well!

olivercalder · 2024-10-09T19:05:24Z

Problem seems to be the kernel locking up waiting for a reply?? I'm testing in a lxd VM. After sending the first write request, sending another one does not cause a kernel request, and trying to open another shell fails.

I'll record a video to show what I'm seeing...

olivercalder · 2024-10-09T20:55:13Z

Here's the recording:

2024-10-09_14-11-21.mp4

Some annotations:

At 5:40: Yes, the write did succeed, and prompting-client.scripted will exit successfully, it's just first waiting for the 10 second grace period to elapse to make sure that no unexpected prompts occur.
At 6:30: We see responses sent back to the kernel for several previous requests. Those are the requests for test1.txt, test2.txt, test3.txt, plus two requests related to test4.txt (IDs 11 and 12): one for open/create (request ID 11) and one for actually writing the data to test4.txt (request ID 12). So snapd is doing what we expect it to be doing with respect to the outstanding prompts. But look: for all the other request IDs besides 11 and 12 (including the one for ID 6 which I incorrectly highlight), we see an error when responding: "error while responding to kernel: cannot perform IOCTL request APPARMOR_NOTIF_SEND: no such file or directory". But the response for ID 12 occurs after other "no such file or directory" failures, so it's not like the listener socket doesn't exist: that message must refer to the file which we're trying to allow a write to, and that file has since timed out and failed to be created. The question is where/why the hang is occurring in the shell, which prevents us from kicking off subsequent writes. Since it's the shell itself hanging, I suspect it's something to do with the kernel rather than snapd locking up (which was an initial theory of mine). But if it's the kernel hanging, then I'm not sure why it wouldn't hang when we send a single request and the prompting-client.scripted sends back a response. So it seems to me that something about a subprocess of the shell (or in the case of the video, another shell spawned by tmux) being blocked on a syscall.
At 7:30: To clarify, they're not receiving a response (at least from snapd) until well after they time out, so they're being denied either by the kernel directly or by the application (snap run --shell firefox -c "..."). We did see the explicit apparmor="DENIED" ... name="test1.txt" earlier at 2:35, so it seems to me it's the kernel blocking this.

I'm going to test another theory to see if I can bypass the shell locking up and get the test to behave. Another (shorter) video incoming...

olivercalder · 2024-10-09T22:53:09Z

Okay, very interesting. Adding a sleep (to get the inner bash shells running first before anything hangs on the write) does not help, the same thing occurs. But we also saw an actual kernel trace this time:

Here's the recording, with some (overly-verbose, sorry) commentary about what's going on: https://bucket.calder.dev/integration-test-debugging/sleep-first.mp4

And here's the kernel trace:

Oct 09 22:10:45 integration kernel: INFO: task bash:19241 blocked for more than 122 seconds.
Oct 09 22:10:45 integration kernel:       Not tainted 6.8.0-45-generic #45-Ubuntu
Oct 09 22:10:45 integration kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 09 22:10:45 integration kernel: task:bash            state:D stack:0     pid:19241 tgid:19241 ppid:11904  flags:0x00000002
Oct 09 22:10:45 integration kernel: Call Trace:
Oct 09 22:10:45 integration kernel:  <TASK>
Oct 09 22:10:45 integration kernel:  schedule+0x33/0x110
Oct 09 22:10:45 integration kernel:  schedule_preempt_disabled+0x15/0x30
Oct 09 22:10:45 integration kernel:  rwsem_down_write_slowpath+0x27e/0x550
Oct 09 22:10:45 integration kernel:  ? step_into+0xfe/0x390
Oct 09 22:10:45 integration kernel:  down_write+0x5c/0x80
Oct 09 22:10:45 integration kernel:  open_last_lookups+0x137/0x400
Oct 09 22:10:45 integration kernel:  path_openat+0x99/0x2d0
Oct 09 22:10:45 integration kernel:  ? syscall_exit_to_user_mode+0x89/0x260
Oct 09 22:10:45 integration kernel:  do_sys_openat2+0xb3/0xe0
Oct 09 22:10:45 integration kernel:  __x64_sys_openat+0x55/0xa0
Oct 09 22:10:45 integration kernel:  x64_sys_call+0x1eb8/0x25c0
Oct 09 22:10:45 integration kernel:  do_syscall_64+0x7f/0x180
Oct 09 22:10:45 integration kernel:  ? irqentry_exit+0x43/0x50
Oct 09 22:10:45 integration kernel:  ? clear_bhb_loop+0x15/0x70
Oct 09 22:10:45 integration kernel:  ? clear_bhb_loop+0x15/0x70
Oct 09 22:10:45 integration kernel:  ? clear_bhb_loop+0x15/0x70
Oct 09 22:10:45 integration kernel: RIP: 0033:0x721bae97253b
Oct 09 22:10:45 integration kernel: RSP: 002b:00007ffec1a38490 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Oct 09 22:10:45 integration kernel: RAX: ffffffffffffffda RBX: 00005cfd2ac47260 RCX: 0000721bae97253b
Oct 09 22:10:45 integration kernel: RDX: 0000000000000241 RSI: 00005cfd2ac62c70 RDI: 00000000ffffff9c
Oct 09 22:10:45 integration kernel: RBP: 00005cfd2ac62c70 R08: 0000000000000000 R09: 0000000000000020
Oct 09 22:10:45 integration kernel: R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000241
Oct 09 22:10:45 integration kernel: R13: 0000000000000000 R14: 00005cfd2ac62c70 R15: 0000721bae85d300
Oct 09 22:10:45 integration kernel:  </TASK>

Okay, I tried again, this time running each as sudo -iu ubuntu snap run --shell "...", and it also didn't work, though I expected it would: https://bucket.calder.dev/integration-test-debugging/login-shell.mp4

Lastly, I tried with sudo -iu ubuntu snap run --shell "sleep 10; ...", and that also didn't work. No nice kernel traces this time either. Nothing much new learned here, but here's the recording: https://bucket.calder.dev/integration-test-debugging/login-shell-sleep-first.mp4

So in conclusion, once one write is initiated (and not actioned by a prompt), it seems to lock up the shell and prevent other writes from being carried out. Interestingly, snapd and the prompting client are still able to respond to the kernel, but new requests aren't received by snapd from the kernel, and the shell seems to lock up. So I'm not sure what to make of this... I think I can add some more listener debug logging to make sure it's not snapd locking up, but since the shell locks up, I think it's something else.

olivercalder · 2024-10-09T22:54:47Z

What I really need to do is try on oracular, as that has the fixes for the .part file download problem.

olivercalder · 2024-10-10T04:03:04Z

I tried this on oracular, with basically the same result (though I did identify #14593 along the way 🙃).

Here's the recording: https://bucket.calder.dev/integration-test-debugging/oracular.mp4

And here's the journalctl log, with echo -n upcall > /sys/module/apparmor/parameters/debug as well for extra logging -- I annotated the important events with messages like ### ... ###, so please grep around, I didn't prune anything, for completeness, and didn't manage to collect the full logs until a while after the testing ended: https://bucket.calder.dev/integration-test-debugging/oracular-log.txt

Most interesting immediate takeaway for me is that the journal stops receiving messages for ~1min after each write blocks.

olivercalder · 2024-10-10T14:02:33Z

Just noticed something else: the system locking up only occurs when prompting is blocked on a write, not on a read. And the things which are locked up also seem to be writes (though some writes do seem to be able to still occur, such as the client writing a response to the snapd API socket, or snapd writing a response to the kernel (using ioctl).

Here's the recording:

read-vs-write.mp4

olivercalder · 2024-10-10T14:27:59Z

Confirmed, I think: when I use bash instead of zsh, I can carry on doing things (e.g. echo hello) while the prompt for write is sitting open (blocking the write), but the moment I try to touch foo, that command hangs.

olivercalder · 2024-10-10T14:37:19Z

Oh, it's more interesting than that: it's that file creation in the same directory hangs, but other file creation succeeds!

In the file below, everything from echo hello onward happened after the write to /home/oac/tag.png was hanging. So I could echo hello, could read a file (in another directory), could write a file in another directory, but as soon as I tried to write a different file in the same directory, it hung. Very very interesting.

ZeyadYasser

Thank you, looks really good! did a first pass.

Also, maybe the script files can go under a testdata directory so that they are not included in the go package? feel free to ignore this for now, it may make sense to do this for all tests anyway later.
https://pkg.go.dev/cmd/go#hdr-Package_lists_and_patterns

Directory and file names that begin with "." or "_" are ignored by the go tool, as are directories named "testdata".

tests/main/apparmor-prompting-integration-tests/task.yaml

ZeyadYasser · 2024-10-11T13:45:30Z

tests/main/apparmor-prompting-integration-tests/task.yaml

+    tests.session -u test exec prompting-client.scripted \
+        --script="${TEST_DIR}/script.json" \
+        --grace-period=1 \
+        --var="BASE_PATH:${TEST_DIR}" | tee "${TEST_DIR}/result" &


Does the prompting client keep running endlessly in the background?

How about using systemd-run to manage its life cycle instead?

The usual prompting client daemon is running in the background, yes. This may not be necessary, but if snapd restarts, it'll check for the presence of a handler-service app and restart it if it's not running. I could experiment with disabling the prompting-client.daemon service...

The scripted client will terminate when it finishes its script and waits the grace period to see if any unexpected prompts occur. The scripted client needs to run as the test user, would systemd-run work for that? I'm wary of doing anything much different than a normal user invoking prompting-client.scripted directly.

Ah I think you're right, if the test script exits early, the prompting client needs to be manually killed. I added this fix in one of the recent PRs.

Open to exploring systemd-run though...

Yes you can easily run as any user, the added benefit is that you don't need to handle pids yourself you let systemd handle that for you.

This is very similar to what you could use

snapd/tests/main/snap-remove-terminate/task.yaml

Line 56 in f9e69aa

tests.session -u test exec systemd-run --user --unit test-kill.service flock "$lockfile" "$sh_snap_bin" -c 'touch $SNAP_USER_COMMON/alive; sleep 100000'

pedronis

thank you for pushing this forward, did an high-level pass, and looked only at the first of the tests, there's quite a bit of timeout/sleep/retry based code, I'm worried about the behavior of that on slow testing machines. If we cannot do better, you should consider being at least a bit more pessimistic with the numbers, it might also make sense to code the most common wait/timeout times etc as env vars instead of having to change them one by one

pedronis · 2024-10-14T09:07:11Z

tests/lib/tools/tests.session

-			--)
-				shift
-				break
-				;;


why this change? seems a bit unrelated, also @sergiocazzolato should review

The existing help text didn't actually work:

show_help() { echo "usage: tests.session exec [-u USER] [-p PID_FILE] [--] <CMD>" echo " tests.session prepare | restore [-u USER | -u USER1,USER2,...]" echo " tests.session kill-leaked" echo " tests.session dump" echo " tests.session has-system-systemd-and-dbus" echo " tests.session has-session-systemd-and-dbus" }

Here's the switch for the exec keyword:

exec) action="exec" shift # remaining arguments are the command to execute break ;;

So if you try to do tests.session exec -u test -- echo foo, it'll treat -u, test, -- echo foo as the command to execute, which is wrong.

So I saw two options:

change the behavior of exec so it actually expects -- and doesn't break immediately

remove the -- from the exec help text

I checked every usage of tests.session in spread tests, and all of them put -u USER -p PID_FILE before the exec. So I did the latter with regards to exec. And -- isn't used by any other command, so I thought it best to remove it.

Agree, the help needs an update, after the exec command it breaks, so what it expected is the cmd to execute.

other way to do this is removing the break in exec, this will force to use -- to delimit the command to execute.
But this will require a change in the tests which should be done in a separate pr

tests/main/apparmor-prompting-integration-tests/task.yaml

pedronis · 2024-10-14T09:12:43Z

tests/main/apparmor-prompting-integration-tests/task.yaml

+    echo "Run the test script as the test user"
+    if ! tests.session -u test exec sh -x "${TEST_DIR}/${VARIANT}.sh" "$TEST_DIR" ; then
+        # kill the prompting-client-scripted process for this test run
+        pkill -f "prompting-client-scripted.*${TEST_DIR}"


shouldn't we always kill it in restore per above discussion?

Yes, but would the scripted client running in the background prevent the execute phase from completing, and thus prevent restore from starting? In testing, I found that if the scripted client did not terminate during execute, the test would sit waiting until it timed out.

Old answer when I hadn't caught the "in restore" in your question:

The prompting client scripted mode will terminate successfully after it has observed all expected prompts and the grace period elapses (so it's sure no unexpected prompts occurred after the last expected one). It will terminate unsuccessfully if it sees any unexpected prompt or if it receives an error after talking to the snapd API. I believe the only case when we need to kill it explicitly is if the shell script for the test case exited early for some reason.

probably worth expanding the comment on the pkill to mention that is still running because it might not have observed all the expected prompts?

tests/main/apparmor-prompting-integration-tests/task.yaml

pedronis · 2024-10-14T09:14:25Z

tests/main/apparmor-prompting-integration-tests/read_single_allow.json

+  "prompt-filter": {
+    "snap": "prompting-client",
+    "interface": "home",
+    "constraints": {
+      "path": "$BASE_PATH/.*"
+    }
+  },


what's the role of this outside of prompts?

The overall purpose of the scripted client is to test that the correct prompts occur in the correct order. This top-level filter applies first, so only prompts matching this filter are considered when checking the order and contents of the prompts. For these tests, the top-level filter is a way of isolating the different test cases, so only prompts which have a path starting with the passed-in base path are checked against the script, and that base path is mktemp-unique to the test case.

sergiocazzolato · 2024-10-15T15:48:36Z

tests/lib/tools/tests.session

@@ -1,7 +1,7 @@
 #!/bin/bash -e

 show_help() {
-	echo "usage: tests.session exec [-u USER] [-p PID_FILE] [--] <CMD>"
+	echo "usage: tests.session [-u USER] [-p PID_FILE] exec <CMD>"


we could have prepare and retore following the same logic
tests.session [-u USER | -u USER1,USER2,...]" prepare | restore

In fact, in most the tests it is used like this

We need just break after prepare/restore commands are found

Most tests already do -u prepare and -u restore, but there are a few exceptions:

me@hostname:~/Projects/snapd$ grep -r 'tests\.session .*prepare.* -u.*' tests/regression/lp-2065077/task.yaml: tests.session prepare -u test tests/lib/tools/tests.session: echo " tests.session prepare | restore [-u USER | -u USER1,USER2,...]" tests/lib/tools/suite/tests.session/task.yaml: tests.session prepare -u "$USER" tests/main/snap-user-dir-perms-fixed/task.yaml: tests.session prepare -u "$USER" tests/main/user-session-env/task.yaml: tests.session prepare -u "$user" me@hostname:~/Projects/snapd$ grep -r 'tests\.session .*restore.* -u.*' tests/regression/lp-2065077/task.yaml: tests.cleanup defer tests.session restore -u test tests/lib/tools/tests.session: echo " tests.session prepare | restore [-u USER | -u USER1,USER2,...]" tests/lib/tools/suite/tests.session/task.yaml: tests.session restore -u "$USER" tests/main/snap-user-dir-perms-fixed/task.yaml: tests.session restore -u "$USER" tests/main/user-session-env/task.yaml: tests.session restore -u "$user"

Seeing as prepare/restore work fine either way, I don't see a need to change them in this PR, but the change should be simple enough to enforce -u prepare and -u restore always.

tests/main/apparmor-prompting-integration-tests/task.yaml

Signed-off-by: Oliver Calder <[email protected]>

…uce sleeps Signed-off-by: Oliver Calder <[email protected]>

Signed-off-by: Oliver Calder <[email protected]>

When creating a new file is blocked on a reply to a request prompt, the directory in which the file will be created is locked from other writes. Thus, we can't queue up multiple outstanding writes on files in the same directory. Instead, we must write files in different directories in order for this test to succeed. Signed-off-by: Oliver Calder <[email protected]>

Signed-off-by: Oliver Calder <[email protected]>

…st dir to terminate Signed-off-by: Oliver Calder <[email protected]>

Signed-off-by: Oliver Calder <[email protected]>

Now that canonical#14538 has landed, rules may overlap as long as their outcomes do not conflict. As such, the download_file_defaults test case is no longer expected to fail. Signed-off-by: Oliver Calder <[email protected]>

Signed-off-by: Oliver Calder <[email protected]>

olivercalder · 2024-10-22T19:58:49Z

Test failures:

Prepare failed:

google-pro:ubuntu-fips-22.04-64:tests/fips/ --- unrelated

Execute failed:

google:ubuntu-24.04-64:tests/main/upgrade-from-release --- unrelated, fixed on master
google-core:ubuntu-core-18-64:tests/core/snapd-refresh-vs-services-reboots --- unrelated, Andrew has a pending fix
google-core:ubuntu-core-18-64:tests/core/snapd-refresh-vs-services:start_w_2_49_2 --- unrelated, Andrew has a pending fix
google-core:ubuntu-core-18-64:tests/core/snapd-refresh-vs-services:start_w_pr --- unrelated, Andrew has a pending fix
google-core:ubuntu-core-18-64:tests/core/snapd-refresh-vs-services:start_w_stable --- unrelated, Andrew has a pending fix
google-core:ubuntu-core-24-64:tests/core/mem-cgroup-disabled --- unrelated

Restore failed:

google-distro-1:fedora-39-64:tests/main/component-from-store --- unrelated, fixed on master
openstack:fedora-40-64:tests/main/component-from-store --- unrelated, fixed on master
google-core:ubuntu-core-18-64:tests/core/snapd-refresh-vs-services-reboots --- unrelated, Andrew has pending fix
google-core:ubuntu-core-24-64:tests/core/mem-cgroup-disabled --- unrelated

Plus all of the opensuse tests, none of which are related to this PR: https://github.com/canonical/snapd/actions/runs/11450399848/job/31907785826?pr=14518

This PR adds a new spread test and does not affect production code, so it's clear the other test failures are unrelated to the changes here. The changes to tests.session were verified to not affect other spread tests, as those tests already use the behavior we're now enforcing (and the other behavior was broken).

olivercalder force-pushed the prompting-add-integration-spread-tests branch from 97ec76b to 604508d Compare September 18, 2024 23:28

ernestl added this to the 2.67 milestone Sep 19, 2024

olivercalder requested review from ZeyadYasser and maykathm September 24, 2024 21:27

olivercalder marked this pull request as ready for review September 24, 2024 21:27

maykathm approved these changes Sep 27, 2024

View reviewed changes

olivercalder force-pushed the prompting-add-integration-spread-tests branch from 464fd3a to e14860e Compare October 10, 2024 20:28

ZeyadYasser reviewed Oct 11, 2024

View reviewed changes

olivercalder force-pushed the prompting-add-integration-spread-tests branch from e14860e to 37506e7 Compare October 12, 2024 01:32

olivercalder requested a review from pedronis October 14, 2024 05:02

pedronis reviewed Oct 14, 2024

View reviewed changes

olivercalder requested a review from sergiocazzolato October 15, 2024 15:18

sergiocazzolato reviewed Oct 15, 2024

View reviewed changes

tests/main/apparmor-prompting-integration-tests/task.yaml Outdated Show resolved Hide resolved