iSCSI causes memory corruption
Host environment
- Operating system: Proxmox v7.3-3
- OS/kernel version:
Linux hv-chi 5.15.74-1-pve #1 SMP PVE 5.15.74-1 (Mon, 14 Nov 2022 20:17:15 +0100) x86_64 GNU/Linux
- Architecture: x86_64 on AMD
- QEMU flavor:
qemu-system-x86_64
- QEMU version:
kvm --version
=>QEMU emulator version 7.1.0 (pve-qemu-kvm_7.1.0-4)
- QEMU command line:
# How iSCSI is attached; any other options seem irrelevant
/usr/bin/kvm \
(...)
-iscsi initiator-name=iqn.1993-08.org.debian:01:6189a1634d5 \
-device virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0 \
-drive file=iscsi://xxxxxx/iqn.2022-12.xx.ne0.xxxxx:iscsi/0,if=none,id=drive-scsi0,discard=on,format=raw,cache=none,aio=io_uring,detect-zeroes=unmap \
-device scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,rotation_rate=1 \
-machine type=q35+pve0
The iSCSI connection is a local bridge on the host between two VirtIO interfaces (i.e. the network problems can be excluded). The host is running ECC memory, no memory nor MCE errors are logged by the host.
Emulated/Virtualized environment
- Operating system: Linux, multiple flavors (debian, home assistant, pure debian 11)
- OS/kernel version: varies
- Architecture: x86_64
Description of problem
This is a compound problem, which most likely involves a combination of how TrueNAS SCALE handles iSCSI triggering a problem and some memory-handling issue in QEMU leading to a crash. In short any Linux machine started with iSCSI handled by QEMU directly leads to a hard crash within 30s-1h. I was able to find a pattern in logs:
- First, a message like
QEMU[53139]: kvm: iSCSI Busy/TaskSetFull/TimeOut (retry #1 in 0 ms): TASK_SET_FULL
is logged
- it is always
TASK_SET_FULL
- it is always
retry #1 in ... ms
, where only number of miliseconds varies - the line is repeated multiple times, sometimes 5x and sometimes >200x
- It is followed by a single line with one of the following:
double free or corruption (out)
double free or corruption (!prev)
-
kvm: ../block/block-backend.c:1567: blk_aio_write_entry: Assertion
!qiov || qiov->size == acb->bytes' failed.` -
kvm: malloc.c:2379: sysmalloc: Assertion
(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.` kvm: iSCSI CheckCondition: SENSE KEY:UNIT_ATTENTION(6) ASCQ:BUS_RESET(0x2900)
malloc(): invalid size (unsorted)
- The virtual machine crashes
Steps to reproduce
I don't have a specific concrete steps, only clues really. This problem started happening after TrueNAS SCALE updated their iSCSI code in Bluefin release to a new upstream version. That iSCSI server still works when iSCSI is mounted by the kernel and QEMU uses a normal /dev
entry. While there's probably some problem with it, QEMU shouldn't probably crash with memory errors.
Additional information
While I'm a software developer, I don't code in C on a daily basis. However, looking at the errors, I have a suspicion the problem may be somewhere in the iscsi_co_generic_cb()
, as it seems the struct is getting damaged (out of bound write?) and causes explosion somewhere down the line.