Skip to content

NVMe SQ processing gets stuck when IO queue size is small (for example 4)

Host environment

  • Operating system: Fedora Linux 37 (Workstation Edition)
  • OS/kernel version: Linux fedora-mbpro 6.4.9-100.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Aug 8 21:21:25 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Architecture: x86_64
  • QEMU flavor: qemu-system-x86_64
  • QEMU version: QEMU emulator version 9.0.50 (v9.0.0-1388-g80e8f06021-dirty)
  • QEMU command line:
./qemu-system-x86_64 -m 2G -smp 1 -vnc :1 -gdb tcp::1234,server,nowait \
  -device nvme,serial=deadbeef,drive=nvm,bootindex=0 \
  -drive file=./build/last/usr.img,if=none,id=nvm,cache=none,aio=native \
  -no-reboot -no-shutdown -device virtio-rng-pci -enable-kvm \
  -cpu host,+x2apic -chardev stdio,mux=on,id=stdio,signal=off \
  -mon chardev=stdio,mode=readline -device isa-serial,chardev=stdio

Emulated/Virtualized environment

  • Operating system: OSv
  • OS/kernel version: OSv v0.57.0-194-g0e5f067c
  • Architecture: x86_64

Description of the problem

The guest OSv gets "stuck" most of the time (but not always) when reading/writing from/to the NVMe block device. It only happens when the SQ and CQ are "pathologically" small enough (>=3 and <=8) to make both guest and QEMU-emulated device deal with various edge conditions - for example almost full queues. This problem does not happen if the IO queues are large enough >= 16.

When OSv gets run with a simple "Hello World" example, which normally executes the app and terminates gracefully, instead it would typically hang like this:

OSv v0.57.0-194-g0e5f067c
... (hangs at this point)

or

OSv v0.57.0-194-g0e5f067c
Booted up in 205.22 ms
Cmdline: /hello
Hello from C code
... (hangs at this point, normally should terminate)

After connecting to the guest with gdb and inspecting the state of the NVMe driver, it looks like it is stuck after putting the last entry onto the IO submission queue (there is only 1 pair with a single vCPU), advancing the SQ tail and writing to the SQ doorbell register. It seems stuck waiting to receive an interrupt notifying a new CQ entry about completing the last command it never receives.

As I mentioned above, it happens with small IO queues and I deliberately chose it to test how the new OSv NVMe driver deals with various edge conditions like almost full SQ, etc. I did find some bugs which I fixed, but this particular problem I am reporting seems to be caused by a bug on the QEMU side. But I may be wrong.

After spending hours trying to find another bug on the OSv side, I have enabled tracing on QEMU and added some extra logging to QEMY NVMe code I managed to get better insight into what happens there and I think I have discovered an issue there. Here is a portion of the trace log (which I am attaching fully) that should illustrate the problem (the SQ/CQ size is 4):

pci_nvme_mmio_doorbell_sq sqid 1 new_tail 1
pci_nvme_inc_sq_head new_head 1
nvme_process_sq sq->req_list is empty 1
pci_nvme_io_cmd cid 4 nsid 0x1 sqid 1 opc 0x1 opname 'NVME_NVM_CMD_WRITE'
pci_nvme_write cid 4 opname 'NVME_NVM_CMD_WRITE' nsid 1 nlb 2 count 1024 lba 0x67007
pci_nvme_map_prp trans_len 1024 len 1024 prp1 0x12c7000 prp2 0x0 num_prps 1
pci_nvme_map_addr addr 0x12c7000 len 1024
pci_nvme_mmio_write addr 0x100c data 0x1 size 4
pci_nvme_mmio_doorbell_cq cqid 1 new_head 1
pci_nvme_mmio_write addr 0x1008 data 0x2 size 4

pci_nvme_mmio_doorbell_sq sqid 1 new_tail 2
pci_nvme_process_sq sq (0) OR sq->req_list (1) is empty?
pci_nvme_rw_cb cid 2 blk 'nvm'
pci_nvme_rw_complete_cb cid 2 blk 'nvm'
pci_nvme_enqueue_req_completion cid 2 cqid 1 dw0 0x0 dw1 0x0 status 0x0
nvme_post_cqes sq->req_list is empty 0
pci_nvme_irq_msix raising MSI-X IRQ vector 1
pci_nvme_mmio_write addr 0x100c data 0x2 size 4
pci_nvme_mmio_doorbell_cq cqid 1 new_head 2
pci_nvme_mmio_write addr 0x1008 data 0x3 size 4
pci_nvme_mmio_doorbell_sq sqid 1 new_tail 3

As you can see, QEMU receives a SQ doorbell and sets the SQ tail to 1. It then calls nvme_process_sq, which removes one request from sq->req_list which becomes empty (the extra log line 'nvme_process_sq sq->req_list is empty 1' confirms it), then executes the operation (pci_nvme_io_cmd) and schedules writing a completion entry by calling nvme_enqueue_req_completion().

Meanwhile (after a break in the snippet), new sq doorbell arrives (new tail is 2) and nvme_process_sq gets called again. But this time the sq->req_list is empty (see line 'pci_nvme_process_sq sq (0) OR sq->req_list (1) is empty?') and makes the while loop skip completely:

while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(&sq->req_list))) {

Normally, before nvme_process_sq gets executed, the nvme_post_cqes (scheduled by nvme_enqueue_req_completion()) would add a new entry to sq->req_list but in this race condition, it does not happen and the last command on SQ never gets executed and guest never receives corresponding CQ entry.

I have tried to enlarge the initial size of sq->req_list and sq->io_req by 1 or 2 and it does seem to make the problem "almost" (much more rare) go away.

--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4626,11 +4627,11 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
     sq->size = size;
     sq->cqid = cqid;
     sq->head = sq->tail = 0;
-    sq->io_req = g_new0(NvmeRequest, sq->size);
+    sq->io_req = g_new0(NvmeRequest, sq->size + 2);
 
     QTAILQ_INIT(&sq->req_list);
     QTAILQ_INIT(&sq->out_req_list);
-    for (i = 0; i < sq->size; i++) {
+    for (i = 0; i < sq->size + 2; i++) {
         sq->io_req[i].sq = sq;
         QTAILQ_INSERT_TAIL(&(sq->req_list), &sq->io_req[i], entry);
     }

In addition, I cannot reproduce the problem with the lowest size of IQ queues - 2 - allowed per spes.

Steps to reproduce

  1. Get OSv repo with the NVMe driver and build OSv with the 'Hello World' example:
git clone https://github.com/wkozaczuk/osv.git
cd osv
git checkout nvme_refined
git submodule update --init --recursive 
./scripts/setup.py
./scripts/build image=native-example fs=zfs -j$(nproc)
  1. Run OSv with NVme on and point to your version of QEMU built with tracing enabled:
./scripts/run.py --qemu-path /home/wkozaczuk/projects/qemu/build/qemu-system-x86_64 --nics=0 --nvme -c 1  --pass-arg "--trace pci_nvme_*"

Additional information

I am adding both full QEMU logs with NVMe tracing enabled and diff of my changes to QEMU code to add extra logging.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information