virtiofsd deadlocks when qemu is stopped while debugging.
I've been implementing virtio-fs support in a custom kernel. While debugging it in qemu I've noticed that virtiofsd can deadlock on vu_dispatch_rwlock, when interrupting qemu at the wrong moment.
The problem is the following:
- guest wants to write some data, sends fuse request to virtiofsd.
- virtiofsd worker thread is paused on write()
- host breaks qemu with a debugger, qemu pauses all devices
- virtiofsd receives VU event to stop qeueues, locks vu_dispatch_rwlock, joins all worker threads.
- Worker thread finishes write()ing and resumes, wants to send reply, deadlocks on vu_dispatch_rwlock.
Simply lifting the lock in fv_queue_cleanup_thread() does not work, since qemu crashes when it gets the (unexpected) reply. Even if it did not deadlock, what happens to the request currently being processed by the worker? It is already removed from the virtq and should be answered when the queues are resumed?
Steps to reproduce:
- clone newest qemu from virtio-fs and compile virtiofsd
git clone git@gitlab.com:virtio-fs/qemu.git
cd qemu
git checkout virtio-fs-dev
./configure
make virtiofsd
-
use virt-manager to create fedora server 31 (or any other distro supporting virtiofs).
-
run virtiofsd, enable debug output to see whats going on. Not necessary for the deadlock though.
mkdir testdir
./virtiofsd --socket-path=/tmp/vhostqemu -o source=$(pwd)/testdir -o log_level=debug
- to get minimal example, stop vm and run with qemu 4.2.0 using gdb socket on tcp::1234
sudo qemu-system-x86_64 -hda /var/lib/libvirt/images/fedora31.qcow2 -smp 10 -m 2048 -enable-kvm -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs -object memory-backend-file,id=mem,size=2G,mem-path=/dev/shm,share=on -numa node,memdev=mem -gdb tcp::1234
- in vm: mount virtio-fs and run dd. Force O_SYNC so write() is slow and the bug much more apparent. Still triggers without though, just more rarely.
mount -t virtiofs myfs /mnt
dd if=/dev/zero of=/mnt/testfile oflag=sync
- run poor-mans-profiler on host
for x in $(seq 1 1000); do
gdb -ex "target remote :1234" -ex "set pagination 0" -ex "thread apply all bt" -batch 2> /dev/null
sleep 0.01
done
- qemu system freezes, only way out is to kill virtiofsd on host:
sudo killall -9 virtiofsd
Debug output
Debug output on normal operation:
[6784269638608] [ID: 00031101] unique: 542762, opcode: WRITE (16), nodeid: 2, insize: 80, pid: 964
[6784269648727] [ID: 00031101] lo_write_buf(ino=2, size=512, off=138943488)
[6784269670840] [ID: 00031101] unique: 542762, success, outsize: 24
[6784269674537] [ID: 00031101] virtio_send_msg: elem 0: with 2 in desc of length 24
[6784269720434] [ID: 00031096] fv_queue_thread: Got queue event on Queue 1
[6784269725223] [ID: 00031096] fv_queue_thread: Queue 1 gave evalue: 1 available: in: 24 out: 592
[6784269730694] [ID: 00031096] fv_queue_thread: Waiting for Queue 1 event
[6784269736785] [ID: 00031103] fv_queue_worker: elem 0: with 3 out desc of length 592 bad_in_num=0 bad_out_num=0
[6784269742767] [ID: 00031103] fv_queue_worker: Write special case
[6784269744871] [ID: 00031103] unique: 542764, opcode: WRITE (16), nodeid: 2, insize: 80, pid: 964
[6784269747886] [ID: 00031103] lo_write_buf(ino=2, size=512, off=138944000)
[6784269758146] [ID: 00031103] unique: 542764, success, outsize: 24
[6784269760210] [ID: 00031103] virtio_send_msg: elem 0: with 2 in desc of length 24
[6784271185518] [ID: 00031096] fv_queue_thread: Got queue event on Queue 1
[6784271223360] [ID: 00031096] fv_queue_thread: Queue 1 gave evalue: 1 available: in: 24 out: 592
[6784271249450] [ID: 00031096] fv_queue_thread: Waiting for Queue 1 event
[6784271257174] [ID: 00031109] fv_queue_worker: elem 0: with 3 out desc of length 592 bad_in_num=0 bad_out_num=0
[6784271279617] [ID: 00031109] fv_queue_worker: Write special case
debug output just before it deadlocks:
[6784271283034] [ID: 00031109] unique: 542766, opcode: WRITE (16), nodeid: 2, insize: 80, pid: 964
[6784271289596] [ID: 00031109] lo_write_buf(ino=2, size=512, off=138944512)
[6784271649352] [ID: 00000001] virtio_loop: Got VU event
[6784271668058] [ID: 00000001] fv_queue_set_started: qidx=0 started=0
[6784271683918] [ID: 00031093] fv_queue_thread: kill event on queue 0 - quitting
[6784271904769] [ID: 00000001] fv_remove_watch: TODO! fd=9
[6784271942872] [ID: 00000001] virtio_loop: Waiting for VU event
[6784271972869] [ID: 00000001] virtio_loop: Got VU event
[6784271991565] [ID: 00000001] fv_queue_set_started: qidx=1 started=0
[6784272003447] [ID: 00031096] fv_queue_thread: kill event on queue 1 - quitting
[6784275720335] [ID: 00031109] unique: 542766, success, outsize: 24
[6784275731135] [ID: 00031109] virtio_send_msg: elem 0: with 2 in desc of length 24