Fix deadlock with tap/tun devices for arm guests #1232

AndrewFasano · 2022-10-20T15:25:35Z

No description provided.

A guest boot hangs while probing the network interface when iommu_platform=on is used. The following qemu cli hangs without this patch: # $QEMU \ -netdev tap,fd=3,id=hostnet0,vhost=on,vhostfd=4 3<>/dev/tap67 4<>/dev/host-net \ -device virtio-net-pci,netdev=hostnet0,id=net0,iommu_platform=on,disable-legacy=on \ ... Commit: c471ad0 (vhost_net: device IOTLB support) took care of setting vhostfd to non-blocking when QEMU opens /dev/host-net but if the fd is passed from qemu cli then we need to ensure that fd is set to non-blocking. Fixes: c471ad0 ("vhost_net: device IOTLB support") Cc: [email protected] Cc: Michael S. Tsirkin <[email protected]> Cc: Jason Wang <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Jason Wang <[email protected]>

When QEMU sets up a tap based network device backend, it mostly ignores errors reported from various ioctl() calls it makes, assuming the TAP file descriptor is valid. This assumption can easily be violated when the user is passing in a pre-opened file descriptor. At best, the ioctls may fail with a -EBADF, but if the user passes in a bogus FD number that happens to clash with a FD number that QEMU has opened internally for another reason, a wide variety of errnos may result, as the TUNGETIFF ioctl number may map to a completely different command on a different type of file. By ignoring all these errors, QEMU sets up a zombie network backend that will never pass any data. Even worse, when QEMU shuts down, or that network backend is hot-removed, it will close this bogus file descriptor, which could belong to another QEMU device backend. There's no obvious guaranteed reliable way to detect that a FD genuinely is a TAP device, as opposed to a UNIX socket, or pipe, or something else. Checking the errno from probing vnet hdr flag though, does catch the big common cases. ie calling TUNGETIFF will return EBADF for an invalid FD, and ENOTTY when FD is a UNIX socket, or pipe which catches accidental collisions with FDs used for stdio, or monitor socket. Previously the example below where bogus fd 9 collides with the FD used for the chardev saw: $ ./x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hostnet0,fd=9 \ -chardev socket,id=charchannel0,path=/tmp/qga,server,nowait \ -monitor stdio -vnc :0 qemu-system-x86_64: -netdev tap,id=hostnet0,fd=9: TUNGETIFF ioctl() failed: Inappropriate ioctl for device TUNSETOFFLOAD ioctl() failed: Bad address QEMU 2.9.1 monitor - type 'help' for more information (qemu) Warning: netdev hostnet0 has no peer which gives a running QEMU with a zombie network backend. With this change applied we get an error message and QEMU immediately exits before carrying on and making a bigger disaster: $ ./x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hostnet0,fd=9 \ -chardev socket,id=charchannel0,path=/tmp/qga,server,nowait \ -monitor stdio -vnc :0 qemu-system-x86_64: -netdev tap,id=hostnet0,vhost=on,fd=9: Unable to query TUNGETIFF on FD 9: Inappropriate ioctl for device Reported-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Daniel P. Berrange <[email protected]> Tested-by: Dr. David Alan Gilbert <[email protected]> Message-id: [email protected] [lv: to simplify, don't check on EINVAL with TUNGETIFF as it exists since v2.6.27] Signed-off-by: Laurent Vivier <[email protected]> Signed-off-by: Jason Wang <[email protected]>

When using C11 atomics, non-seqcst reads and writes do not participate in the total order of seqcst operations. In util/async.c and util/aio-posix.c, in particular, the pattern that we use write ctx->notify_me write bh->scheduled read bh->scheduled read ctx->notify_me if !bh->scheduled, sleep if ctx->notify_me, notify needs to use seqcst operations for both the write and the read. In general this is something that we do not want, because there can be many sources that are polled in addition to bottom halves. The alternative is to place a seqcst memory barrier between the write and the read. This also comes with a disadvantage, in that the memory barrier is implicit on strongly-ordered architectures and it wastes a few dozen clock cycles. Fortunately, ctx->notify_me is never written concurrently by two threads, so we can assert that and relax the writes to ctx->notify_me. The resulting solution works and performs well on both aarch64 and x86. Note that the atomic_set/atomic_read combination is not an atomic read-modify-write, and therefore it is even weaker than C11 ATOMIC_RELAXED; on x86, ATOMIC_RELAXED compiles to a locked operation. Analyzed-by: Ying Fang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Tested-by: Ying Fang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Stefan Hajnoczi <[email protected]>

These iothread mutex locks/unlocks were dropped somewhere between qemu 2.9.1 and our current codebase. Assuming they don't break record/replay, they probalby should still be here. Without these, PANDA would deadlock about 75% of the time with the vexpress-a9 ARM guest machine when using -netdev tap and virtio networking.

zestrada · 2022-10-20T15:40:09Z

Can confirm this works for the target where we first saw the bug. There appears to be an issue with the barebox minimal test case (with or without tap device):

/out/panda-taphang/build/arm-softmmu/panda-system-arm -M vexpress-a9 -nographic -no-reboot -monitor telnet::4443,server,nowait -kernel ./images/barebox-vexpress-ca9.img -m 1024m             
audio: Could not init `oss' audio driver                                         

barebox 2022.10.0 #1 Wed Oct 19 23:03:50 UTC 2022


Board: V2P-CA9
panda-system-arm: /out/panda-taphang/memory.c:918: memory_region_transaction_commit: Assertion `qemu_mutex_iothread_locked()' failed.
Aborted (core dumped)

…ng and unlocking fns

zestrada · 2022-10-21T13:32:54Z

Just a quick comment in case others run into this issue. The systems that were deadlocking were Linux kernels that somehow entered a race condition in QEMU during guest execution of the calibrate_delay kernel function. A hack to get around this has been to specify a loops per jiffy from a successful boot (e.g., without a tap device) on the kernel command line (e.g., lpj=43648). Your mileage may vary.

codomania and others added 4 commits October 20, 2022 08:33

Restore more iothread mutex locks/unlocks, including asserts in locki…

b5aad15

…ng and unlocking fns

zestrada mentioned this pull request Nov 2, 2022

PANDA hangs using network tap with QEMU vexpress-a9 ARM board #1244

Open

zestrada linked an issue Nov 2, 2022 that may be closed by this pull request

PANDA hangs using network tap with QEMU vexpress-a9 ARM board #1244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock with tap/tun devices for arm guests #1232

Fix deadlock with tap/tun devices for arm guests #1232

AndrewFasano commented Oct 20, 2022

zestrada commented Oct 20, 2022

zestrada commented Oct 21, 2022

Fix deadlock with tap/tun devices for arm guests #1232

Are you sure you want to change the base?

Fix deadlock with tap/tun devices for arm guests #1232

Conversation

AndrewFasano commented Oct 20, 2022

zestrada commented Oct 20, 2022

zestrada commented Oct 21, 2022