watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0]
Host environment
- Operating system: proxmox 6.4
- OS/kernel version: pve-manager/6.4-13/9f411e79 (running kernel: 5.4.157-1-pve)
- Architecture: x86_64
- QEMU flavor: pve-qemu-kvm 5.2.0-6
- QEMU version: pve-qemu-kvm_5.2
- QEMU command line:
/usr/bin/kvm -id 116 -name iotest1 -no-shutdown -chardev socket,id=qmp,path=/var/run/qemu-server/116.qmp,server,nowait -mon chardev=qmp,mode=control -chardev socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5 -mon chardev=qmp-event,mode=control -pidfile /var/run/qemu-server/116.pid -daemonize -smbios type=1,uuid=77eb2d4b-2a90-4dc2-a8ae-c3ad61ffa719 -smp 2,sockets=1,cores=2,maxcpus=2 -nodefaults -boot menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg -vnc unix:/var/run/qemu-server/116.vnc,password -cpu kvm64,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,+lahf_lm,+sep -m 1024 -device pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e -device pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f -device vmgenid,guid=b41dc047-4286-4737-a090-fbb0aca732f4 -device piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=tablet,bus=uhci.0,port=1 -device VGA,id=vga,bus=pci.0,addr=0x2 -chardev socket,path=/var/run/qemu-server/116.qga,server,nowait,id=qga0 -device virtio-serial,id=qga0,bus=pci.0,addr=0x8 -device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -iscsi initiator-name=iqn.1993-08.org.debian:01:774364916734 -drive if=none,id=drive-ide2,media=cdrom,aio=threads -device ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101 -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 -drive file=/hddpool/vms-files2-lz4/images/116/vm-116-disk-0.qcow2,if=none,id=drive-scsi0,cache=writeback,format=qcow2,aio=threads,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100 -netdev type=tap,id=net0,ifname=tap116i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on -device virtio-net-pci,mac=A2:9E:B2:61:C2:F2,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=102 -machine type=pc+pve0
Emulated/Virtualized environment
- Operating system: Debian 10.10
- OS/kernel version: 4.19.0-17-amd64
- Architecture: x86_64
Description of problem
During virtual disk live move/migration, VMs get severe stuttering and even cpu soft lockups, as described here:
https://bugzilla.kernel.org/show_bug.cgi?id=199727
This also happens on some of our virtual machines when i/o load inside VM is high or workload is fsync centric.
i'm searching for a solution to mitigate this problem, i.e. i can live with the stuttering/delays of several seconds, but getting cpu soft lockups of 22s or higher is inacceptable.
i have searched the web for a long long time now, but did not find a solution , nor did i find a way on how to troubleshoot this more in depth to find the real root cause.
if this issue report will not getting accepted because of "non native qemu" (i.e. proxmox platform) , please tell me which qemu/distro i can/should use instead (which has easy usable live migration feature) to try reproducing the problem.
Steps to reproduce
- do a live migration of one or more virtual machine disks
- watch "ioping -WWWYy test.dat" inside VM (being moved) for disk latency
- you disk latency is heavily varying , from time to time it goes up to vaues of tens seconds, even leading to kernel messages like " kernel:[ 2155.520846] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0]"
4 KiB >>> test.dat (ext4 /dev/sda1): request=55 time=1.07 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=56 time=1.24 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=57 time=567.4 ms (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=58 time=779.0 ms (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=59 time=589.0 ms (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=60 time=1.57 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=61 time=847.7 ms (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=62 time=933.0 ms
4 KiB >>> test.dat (ext4 /dev/sda1): request=63 time=891.4 ms (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=64 time=820.8 ms (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=65 time=1.02 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=66 time=2.44 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=67 time=620.7 ms (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=68 time=1.03 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=69 time=1.24 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=70 time=1.42 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=71 time=1.36 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=72 time=1.41 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=73 time=1.33 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=74 time=2.36 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=75 time=1.46 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=76 time=1.45 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=77 time=1.28 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=78 time=1.41 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=79 time=2.33 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=80 time=1.39 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=81 time=1.35 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=82 time=1.54 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=83 time=1.52 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=84 time=1.50 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=85 time=2.00 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=86 time=1.47 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=87 time=1.26 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=88 time=1.29 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=89 time=2.05 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=90 time=1.44 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=91 time=1.43 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=92 time=1.72 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=93 time=1.77 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=94 time=2.56 s
Message from syslogd@iotest2 at Jan 14 14:51:12 ...
kernel:[ 2155.520846] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0]
4 KiB >>> test.dat (ext4 /dev/sda1): request=95 time=22.5 s (slow)
4 KiB >>> test.dat (ext4 /dev/sda1): request=96 time=3.56 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=97 time=1.52 s (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=98 time=1.69 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=99 time=1.90 s
4 KiB >>> test.dat (ext4 /dev/sda1): request=100 time=1.15 s (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=101 time=890.0 ms (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=102 time=959.6 ms (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=103 time=926.5 ms (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=104 time=791.5 ms (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=105 time=577.8 ms (fast)
4 KiB >>> test.dat (ext4 /dev/sda1): request=106 time=867.7 ms (fast)