xen/pt: Incorrect register mask for PCI passthrough prevents Linux guest from completing boot process

Host environment

Operating system: Debian unstable
OS/kernel version: Linux debian 5.17.0-3-amd64 #1 SMP PREEMPT Debian 5.17.11-1 (2022-05-26) x86_64 GNU/Linux - this kernel is enabled for and running as a Xen Dom0 privileged domain on the Xen hypervisor
Architecture: x86_64, Xen hypervisor version 4.16.1
QEMU flavor: qemu-system-i386 with Xen accelerator, Qemu running in the Xen Dom0 privileged domain as the device model for a Xen HVM domain
QEMU version: 7.0.0
QEMU command line: /usr/bin/qemu-system-i386 -xen-domid 1 -no-shutdown -chardev socket,id=libxl-cmd,path=/var/run/xen/qmp-libxl-1,server,nowait -mon chardev=libxl-cmd,mode=control -chardev socket,id=libxenstat-cmd,path=/var/run/xen/qmp-libxenstat-1,server,nowait -mon chardev=libxenstat-cmd,mode=control -nodefaults -no-user-config -name bullseye -vnc none -display none -serial pty -boot order=c -smp 4,maxcpus=4 -device e1000,id=nic0,netdev=net0 -netdev type=tap,id=net0,ifname=vif1.0-emu,script=no,downscript=no -machine xenfv,igd-passthru=on -m 3072 -drive file=/dev/systems/linux,if=ide,index=0,media=disk,format=raw,cache=writeback
Sorry about the long command line, but if requested I will try to send another command line that exhibits the bug that is shorter. I do not think it is possible to reproduce this bug without using the xl toolstack which is required to configure PCI passthrough to Xen HVM guests. I can probably reproduce the bug without the e1000 emulated network device since Xen guests use the XEN PV network devices instead. The important command line switch that causes this bug is -machine xenfv,igd-passthru=on and the fact that using the Xen xl toolstack, 3 PCI devices are passed through to the Xen HVM guest that use Qemu as the device model running in Xen's Dom0. Qemu's setup of the passed through PCI devices is where the problem is.

Emulated/Virtualized environment

Operating system: Debian 11.3
OS/kernel version: Linux bullseye 5.10.0-14-amd64 #1 SMP Debian 5.10.113-1 (2022-04-29) x86_64 GNU/Linux
Architecture: x86_64, Xen HVM guest with Qemu device model and PCI passthrough

Description of problem

In brief, the problem is that PCI/GPU passthrough functions normally with Xen/Qemu if the Xen HVM guest is Windows, but if the guest is Linux, the guest will not complete the booting process and it never reaches the systemd targets that allow the GUI environment to start and login to the desktop environment. The problem is that a bug in the way Qemu initializes the PCI status register of the passed through devices causes the PCI capabilities list bit of the PCI status register to be disabled instead of enabled. This in turn disables the MSI-x interrupt handling capability of the passed through PCI devices. I think the reason only Linux guests are affected is that Linux guests use a different method of delivering interrupts from the passed through PCI devices to the guest from the method used by Windows guests, and the method used by Windows does not require the MSI-x capability of the PCI devices but the method used by Linux does need the MSI-x capability of the passed through devices. I will explain this further in the additional information section with logs and other relevant information.

Steps to reproduce

It might only be reproducible on specific hardware. It is very reproducible on my system, an ASRock B85M-Pro4 with BIOS version P2.50 and a Haswell core i5-4590S CPU.
Configure the system to pass through the Intel integrated graphics device (IGD), the on-board USB 3 controller, and the onboard PCI audio device to a Windows Xen HVM guest with Qemu running as the device model for the Windows guest in Dom0 using the Xen xl toolstack, and verify that the Windows guest boots and functions properly. This is not trivial and can probably only be done by persons familiar with Xen and its PCI and VGA/GPU passthrough feature. Here is the xl domain configuration file that the Xen xl toolstack used to create and boot the working Windows HVM domain with passthrough of three PCI devices on my hardware:

builder = 'hvm'
bios = 'seabios'
memory = '3072'
vcpus = '4'
device_model_version = 'qemu-xen'
disk = ['/dev/systems/windows,,xvda,w']
name = 'bullseye'
vif = [ 'model=e1000,script=vif-route,ip=<redacted>' ]
on_poweroff = 'destroy'
on_reboot = 'restart'
on_crash = 'destroy'
boot = 'c'
acpi = '1'
apic = '1'
viridian = '1'
xen_platform_pci = '1'
serial = 'pty'
vga = 'none'
sdl = '0'
vnc = '0'
gfx_passthru = '1'
pci = [ '00:1b.0', '00:14.0,rdm_policy=relaxed', '00:02.0' ]

Shut down the working Windows Xen HVM and replace it with a Linux Xen HVM disk image and try to boot that in place of Windows, keeping all other configuration options the same as with the working Windows guest. To create and boot the non-working Linux HVM domain, I used the same xl domain configuration as for Windows with the exception that the disk line was replaced with:

disk = ['/dev/systems/linux,,xvda,w']

which obviously points to a virtual disk that boots Linux instead of Windows. A Linux guest, such as Debian bullseye or Debian buster or Debian sid will not boot properly and instead exhibit the problem handling IRQs from the passed through PCI devices, as discussed above.

Additional information

This problem is known by QubesOS and they have been using a patch to fix it since 2017, but they give very few details about the problem in their commit messages:

https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/pull/3/commits/ab2b4c2ad02827a73c52ba561e9a921cc4bb227c

That same patch to hw/xen/xen_pt_config_init.c also fixes the problem on my system.

Some logs:

Without the QubesOS patch, I get error messages indicating problems handling IRQs like this in the Dom0:

May 10 08:50:03 bullseye kernel: [79077.644346] pciback 0000:00:1b.0: xen_pciback: vpci: assign to virtual slot 0
May 10 08:50:03 bullseye kernel: [79077.644478] pciback 0000:00:1b.0: registering for 16
May 10 08:50:03 bullseye kernel: [79077.644732] pciback 0000:00:14.0: xen_pciback: vpci: assign to virtual slot 1
May 10 08:50:03 bullseye kernel: [79077.644874] pciback 0000:00:14.0: registering for 16
May 10 08:50:03 bullseye kernel: [79077.645024] pciback 0000:00:02.0: xen_pciback: vpci: assign to virtual slot 2
May 10 08:50:03 bullseye kernel: [79077.645107] pciback 0000:00:02.0: registering for 16
May 10 08:50:30 bullseye kernel: [79105.273876] vif vif-16-0 vif16.0: Guest Rx ready
May 10 08:50:30 bullseye kernel: [79105.273893] IPv6: ADDRCONF(NETDEV_CHANGE): vif16.0: link becomes ready
May 10 08:50:30 bullseye kernel: [79105.278023] xen-blkback: backend/vbd/16/51712: using 4 queues, protocol 1 (x86_64-abi) persistent grants
May 10 08:50:44 bullseye kernel: [79119.104937] irq 16: nobody cared (try booting with the "irqpoll" option)
May 10 08:50:44 bullseye kernel: [79119.104973] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-6-amd64 #1 Debian 5.10.28-1
May 10 08:50:44 bullseye kernel: [79119.104976] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B85M Pro4, BIOS P2.50 12/11/2015
May 10 08:50:44 bullseye kernel: [79119.104979] Call Trace:
May 10 08:50:44 bullseye kernel: [79119.104984]
May 10 08:50:44 bullseye kernel: [79119.104998] dump_stack+0x6b/0x83
May 10 08:50:44 bullseye kernel: [79119.105008] __report_bad_irq+0x35/0xa7
May 10 08:50:44 bullseye kernel: [79119.105014] note_interrupt.cold+0xb/0x61
May 10 08:50:44 bullseye kernel: [79119.105024] handle_irq_event+0xa8/0xb0
May 10 08:50:44 bullseye kernel: [79119.105030] handle_fasteoi_irq+0x78/0x1c0
May 10 08:50:44 bullseye kernel: [79119.105037] generic_handle_irq+0x47/0x50
May 10 08:50:44 bullseye kernel: [79119.105044] __evtchn_fifo_handle_events+0x175/0x190
May 10 08:50:44 bullseye kernel: [79119.105054] __xen_evtchn_do_upcall+0x66/0xb0
May 10 08:50:44 bullseye kernel: [79119.105063] __xen_pv_evtchn_do_upcall+0x11/0x20
May 10 08:50:44 bullseye kernel: [79119.105069] asm_call_irq_on_stack+0x12/0x20
May 10 08:50:44 bullseye kernel: [79119.105072]
May 10 08:50:44 bullseye kernel: [79119.105079] xen_pv_evtchn_do_upcall+0xa2/0xc0
May 10 08:50:44 bullseye kernel: [79119.105084] exc_xen_hypervisor_callback+0x8/0x10
May 10 08:50:44 bullseye kernel: [79119.105091] RIP: e030:xen_hypercall_sched_op+0xa/0x20
May 10 08:50:44 bullseye kernel: [79119.105097] Code: 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
May 10 08:50:44 bullseye kernel: [79119.105100] RSP: e02b:ffffffff82603de8 EFLAGS: 00000246
May 10 08:50:44 bullseye kernel: [79119.105106] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff810023aa
May 10 08:50:44 bullseye kernel: [79119.105108] RDX: 0000000009d62df2 RSI: 0000000000000000 RDI: 0000000000000001
May 10 08:50:44 bullseye kernel: [79119.105111] RBP: ffffffff82613940 R08: 00000066a1715350 R09: 000047f57b235dc9
May 10 08:50:44 bullseye kernel: [79119.105114] R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000
May 10 08:50:44 bullseye kernel: [79119.105117] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
May 10 08:50:44 bullseye kernel: [79119.105124] ? xen_hypercall_sched_op+0xa/0x20
May 10 08:50:44 bullseye kernel: [79119.105133] ? xen_safe_halt+0xc/0x20
May 10 08:50:44 bullseye kernel: [79119.105140] ? default_idle+0xa/0x10
May 10 08:50:44 bullseye kernel: [79119.105145] ? default_idle_call+0x38/0xc0
May 10 08:50:44 bullseye kernel: [79119.105152] ? do_idle+0x208/0x2b0
May 10 08:50:44 bullseye kernel: [79119.105158] ? cpu_startup_entry+0x19/0x20
May 10 08:50:44 bullseye kernel: [79119.105164] ? start_kernel+0x587/0x5a8
May 10 08:50:44 bullseye kernel: [79119.105170] ? xen_start_kernel+0x625/0x631
May 10 08:50:44 bullseye kernel: [79119.105180] ? startup_xen+0x3e/0x3e
May 10 08:50:44 bullseye kernel: [79119.105184] handlers:
May 10 08:50:44 bullseye kernel: [79119.105222] [<000000005d228d5f>] usb_hcd_irq [usbcore]
May 10 08:50:44 bullseye kernel: [79119.105245] [<00000000e534b010>] ath_isr [ath9k]
May 10 08:50:44 bullseye kernel: [79119.105257] Disabling IRQ #16

Also, without the patch, I get error messages about failure to handle IRQs in the Linux Xen HVM guest:

Oct 23 18:50:32 domU kernel: irq 36: nobody cared (try booting with the "irqpoll" option)
Oct 23 18:50:32 domU kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-9-amd64 #1 Debian 5.10.70-1
Oct 23 18:50:32 domU kernel: Hardware name: Xen HVM domU, BIOS 4.14.3 10/22/2021
Oct 23 18:50:32 domU kernel: Call Trace:
Oct 23 18:50:32 domU kernel:
Oct 23 18:50:32 domU kernel: dump_stack+0x6b/0x83
Oct 23 18:50:32 domU kernel: __report_bad_irq+0x35/0xa7
Oct 23 18:50:32 domU kernel: note_interrupt.cold+0xb/0x61
Oct 23 18:50:32 domU kernel: handle_irq_event+0xa8/0xb0
Oct 23 18:50:32 domU kernel: handle_fasteoi_irq+0x78/0x1c0
Oct 23 18:50:32 domU kernel: generic_handle_irq+0x47/0x50
Oct 23 18:50:32 domU kernel: __evtchn_fifo_handle_events+0x175/0x190
Oct 23 18:50:32 domU kernel: __xen_evtchn_do_upcall+0x66/0xb0
Oct 23 18:50:32 domU kernel: __sysvec_xen_hvm_callback+0x22/0x30
Oct 23 18:50:32 domU kernel: asm_call_irq_on_stack+0x12/0x20
Oct 23 18:50:32 domU kernel:
Oct 23 18:50:32 domU kernel: sysvec_xen_hvm_callback+0x72/0x80
Oct 23 18:50:32 domU kernel: asm_sysvec_xen_hvm_callback+0x12/0x20
Oct 23 18:50:32 domU kernel: RIP: 0010:native_safe_halt+0xe/0x10
Oct 23 18:50:32 domU kernel: Code: 02 20 48 8b 00 a8 08 75 c4 e9 7b ff ff ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d a6 6f 54 00 fb f4 90 e9 07 00 00 00 0f 00 2d 96 6f 54 00 f4 c3 cc cc 0f 1f 44 00
Oct 23 18:50:32 domU kernel: RSP: 0018:ffffffff89003e48 EFLAGS: 00000246
Oct 23 18:50:32 domU kernel: RAX: 0000000000004000 RBX: 0000000000000001 RCX: ffff8dbb7cc2c9c0
Oct 23 18:50:32 domU kernel: RDX: ffff8dbb7cc00000 RSI: ffff8dbaf55b1400 RDI: ffff8dbaf55b1464
Oct 23 18:50:32 domU kernel: RBP: ffff8dbaf55b1464 R08: ffffffff891b9120 R09: 0000000000000008
Oct 23 18:50:32 domU kernel: R10: 000000000000000e R11: 000000000000000d R12: 0000000000000001
Oct 23 18:50:32 domU kernel: R13: ffffffff891b91a0 R14: 0000000000000001 R15: 0000000000000000
Oct 23 18:50:32 domU kernel: ? xen_sched_clock+0x11/0x20
Oct 23 18:50:32 domU kernel: acpi_idle_do_entry+0x46/0x50
Oct 23 18:50:32 domU kernel: acpi_idle_enter+0x86/0xc0
Oct 23 18:50:32 domU kernel: cpuidle_enter_state+0x89/0x350
Oct 23 18:50:32 domU kernel: cpuidle_enter+0x29/0x40
Oct 23 18:50:32 domU kernel: do_idle+0x1ef/0x2b0
Oct 23 18:50:32 domU kernel: cpu_startup_entry+0x19/0x20
Oct 23 18:50:32 domU kernel: start_kernel+0x587/0x5a8
Oct 23 18:50:32 domU kernel: secondary_startup_64_no_verify+0xb0/0xbb
Oct 23 18:50:32 domU kernel: handlers:
Oct 23 18:50:32 domU kernel: [<000000007d3a0964>] usb_hcd_irq [usbcore]
Oct 23 18:50:32 domU kernel: Disabling IRQ #36
Oct 23 18:50:32 domU kernel: PM: Image not found (code -22)
Oct 23 18:50:32 domU kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] ERROR [CRTC:45:pipe A] flip_done timed out

To prove the cause of the bug, I compare some logs without the patch and with the patch that fixes it.

First, relevant logs generated by Qemu in Dom0, for existing Qemu without the patch. On Debian these logs are located in /var/log/xen in the Dom0:

[00:06.0] xen_pt_realize: Assigning real physical device 00:14.0 to devfn 0x30
[...]
[00:06.0] xen_pt_config_reg_init: Offset 0x0006 mismatch! Emulated=0x0010, host=0x0290, syncing to 0x0280.
[...]
[00:06.0] xen_pt_realize: Real physical device 00:14.0 registered successfully
[00:02.0] xen_pt_realize: Assigning real physical device 00:02.0 to devfn 0x10
[...]
[00:02.0] xen_pt_config_reg_init: Offset 0x0006 mismatch! Emulated=0x0010, host=0x0090, syncing to 0x0080.
[...]
[00:02.0] xen_pt_realize: Real physical device 00:02.0 registered successfully

Next, the same logs, but now using a version of Qemu with the patch that fixes the bug:

[00:06.0] xen_pt_realize: Assigning real physical device 00:14.0 to devfn 0x30
[...]
[00:06.0] xen_pt_config_reg_init: Offset 0x0006 mismatch! Emulated=0x0010, host=0x0290, syncing to 0x0290.
[...]
[00:06.0] xen_pt_realize: Real physical device 00:14.0 registered successfully
[00:02.0] xen_pt_realize: Assigning real physical device 00:02.0 to devfn 0x10
[...]
[00:02.0] xen_pt_config_reg_init: Offset 0x0006 mismatch! Emulated=0x0010, host=0x0090, syncing to 0x0090.
[...]
[00:02.0] xen_pt_realize: Real physical device 00:02.0 registered successfully

To decipher what is happening here, one must refer to the definitions in the pci/header.h file from PCI Utilities that in Debian is in the libpci-dev package and is probably in similarly named packages on other distros.

The Offset of 0x0006 corresponds to the 16-bit PCI_STATUS register of the passed through device, and the Emulated value of 0x0010 sets the desired emulated value of the PCI_STATUS_CAP_LIST bit to 1 in the PCI_STATUS register. The host values of 0x0290, 0x0090 correspond to the setting of the register in the physical device for real device 00:14.0 and 00:02.0, respectively. The syncing to value indicates the value of the register that Qemu exposes to the guest. Notice that without the patch, the PCI_STATUS_CAP_LIST bit is turned off for the two PCI devices (register value = 0x0280 and 0x0080 for real device 00:14.0 and 00:02.0, respectively), but the bit is turned on (0x0290 and 0x0090) for these devices with the patch. With the capabilities list enabled, the guest can use the MSI-x capability of the device, but with the capabilities list disabled, the guest cannot use the MSI-x capability of the devices. That explains why this patch is needed in Qemu to fix this problem and enable the Linux guest to use the MSI-x capability of the passed through PCI devices.

This is the QubesOS patch thatfixes it:

--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -1969,7 +1969,7 @@
             /* Mask out host (including past size). */
             new_val = val & host_mask;
             /* Merge emulated ones (excluding the non-emulated ones). */
-            new_val |= data & host_mask;
+            new_val |= data & reg->emu_mask;
             /* Leave intact host and emulated values past the size - even though
              * we do not care as we write per reg->size granularity, but for the
              * logging below lets have the proper value. */

The QubesOS patch that fixes it in Debian's Qemu 7.0.0 build is also attached as a file.xen-fix-emu-mask.patch

~~I will not officially submit it as a patch because I am not its author.~~

I do not know why QubesOS never officially requested that this fix be committed to Qemu upstream, but I hope after review by the maintainers of the code touched by this patch it will be recognized as a necessary fix to a mistake that causes the desired merge of the host and emulated values to be incorrect.

For reference, the commit that is fixed by the QubesOS patch is:

Fixes: 2e87512e (xen/pt: Sync up the dev.config and data values.)

I think perhaps that commit and the patched file might need some other cleanup so I might try my hand at officially submitting a patch to Qemu that fixes this issue on my hardware without breaking something else, because it is possible that the simple QubesOS patch is not suitable as the correct fix.

But before I do that, I wish to make one more comment. In my logs, the only other register than the PCI_STATUS register that is affected by the QubesOS patch is the PCI_HEADER_TYPE register. Without the patch, the register's value is always exposed to the guest as 0x80, and with the patch, the value is always exposed as 0x00 (PCI_HEADER_TYPE_NORMAL as defined in pci/header.h). That is because Qemu sets the initial emulated value of PCI_HEADER_TYPE register to 0x80, but Qemu also sets the emu_mask to 0x00, so after correcting the merging of the host and emulated values with the QubesOS patch, the value is synced to 0x00 instead of 0x80. What I don't understand is why the register is initialized with 0x80, but the emu_mask is 0x00. Shouldn't the emu_mask be 0x80, to pass through the initial emulated value of 0x80? ~~Also, I don't know why the initial emulated value of PCI_HEADER_TYPE is set to 0x80 but I will assume that is the correct emulated value that should be exposed to the guest.~~ Update: After doing some research, I discovered the bit that is set in the PCI_HEADER_TYPE register (0x80) because of this issue is the bit to define the device as a multifunction device. None of my devices are multifunction, and the fact that the multifunction bit is incorrectly set on my passed through devices because of this issue seems to have no effect on the operation of the device or the guest. Apparently the author of the code to initialize the PCI_HEADER_TYPE register planned to initialize every passed through device as a multifunction device, but is this needed? My testing indicates it is not needed on my system.

I am referring to this code in hw/xen/xen_pt_config_init.c:

static int xen_pt_header_type_reg_init(XenPCIPassthroughState *s,
                                       XenPTRegInfo *reg, uint32_t real_offset,
                                       uint32_t *data)
{
    /* read PCI_HEADER_TYPE */
    *data = reg->init_val | 0x80;
    return 0;
}

[...]

     /* Header Type reg */
    {
        .offset     = PCI_HEADER_TYPE,
        .size       = 1,
        .init_val   = 0x00,
        .ro_mask    = 0xFF,
        .emu_mask   = 0x00,
        .init       = xen_pt_header_type_reg_init,
        .u.b.read   = xen_pt_byte_reg_read,
        .u.b.write  = xen_pt_byte_reg_write,
    },

I would appreciate any guidance that experienced Qemu or Xen contributors can give me about this question. If no one gives me any guidance here in a timely manner, I plan to propose my own fix officially to Qemu as the QubesOS patch plus changing the emu_mask value of the PCI_HEADER_TYPE register from 0x00 to 0x80. I verified that fixes the problem I am seeing in the PCI_STATUS register without also causing the change that the QubesOS patch makes to the PCI_HEADER_TYPE register.

I plan to submit a patch to fix this issue, noting the effect the patch has on the PCI_HEADER_TYPE register in the commit message.

Proposed Patch

Link to patch on qemu-devel

I decided not to mention that the PCI_HEADER_TYPE register is also changed by the patch on my system in the commit message because I think that is also a correction of a wrong value, and I just limited the commit message to saying that in some cases, the register was being initiated with the wrong value. Here is the proposed patch:

--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -1966,10 +1966,10 @@ static void xen_pt_config_reg_init(XenPCIPassthroughState *s,
         if ((data & host_mask) != (val & host_mask)) {
             uint32_t new_val;
 
-            /* Mask out host (including past size). */
-            new_val = val & host_mask;
-            /* Merge emulated ones (excluding the non-emulated ones). */
-            new_val |= data & host_mask;
+            /* Merge the emulated bits (data) with the host bits (val)
+             * and mask out the bits past size to enable restoration
+             * of the proper value for logging below. */
+            new_val = XEN_PT_MERGE_VALUE(val, data, host_mask) & size_mask;
             /* Leave intact host and emulated values past the size - even though
              * we do not care as we write per reg->size granularity, but for the
              * logging below lets have the proper value. */

I propose the labels accel:Xen, target:i386, and os:Linux for this issue.

Edited Jun 14, 2022 by Chuck Zmudzinski