Nested vIOMMU PCI Passthrough kernel panics
┌──────────────────────────────┐
│ │
│ L2 Guest (deb12) │ Libvirt / Qemu / Kernel / Machine
│ │ NA / NA / 6.1.0 / pc-q35-8.2
├──────────────────────────────┤
├──────────────────────────────┤
│ │
│ L1 Hypervisor (manjaro) │ Libvirt / Qemu / Kernel / Machine
│ │ 9.10.0 / 8.2.0 / 6.6.10 / pc-q35-8.2
├──────────────────────────────┤
├──────────────────────────────┤
│ │
│ L0 Hypervisor (manjaro) │ Libvirt / Qemu / Kernel / Machine
│ │ 9.10.0 / 8.2.0 / 6.5.13 / physical
└──────────────────────────────┘
Host environment (Please see ASCII diagram)
- Operating system: L0 Manjaro, L1 Manjaro, and Debian 12
- OS/kernel version: L0 6.5.13, L1 6.6.10, L2 6.1.0
- Architecture: x86_64 on all machines
- QEMU flavor: qemu-system-x86_64
- QEMU version: 8.2.0, 8.2.0, NA
- QEMU command line: Directly from Qemu Wiki's VT-d page (added monitor and serial console opts)
See
sudo qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split -m 2G \
-device intel-iommu,intremap=on,caching-mode=on \
-serial telnet:localhost:4321,server,nowait \
-monitor telnet:127.0.0.1:1234,server,nowait \
-device vfio-pci,host=08:00.0 \
$IMAGE_PATH
Emulated/Virtualized environment (Please see the diagram)
- Operating system: L1 Manjaro, L2 Debian
- OS/kernel version: L1 6.6.10, L2 6.1.0
- Architecture: x86_64
Description of problem
In an effort to test vIOMMU according to https://wiki.qemu.org/Features/VT-d I've run into a kernel panic on an L2 guest receiving the L1 hypervisor's PCI passed virtual macvtap hostdev. Upon an ifup
inside the L2 guest, on the network device passed through from the L1 host, the following kernel panic occurs and the L2 guest reboots:
[ OK ] Started ifup@enp0s2.service - ifup for enp0s2.
[ OK ] Started ifup@enp0s3.service - ifup for enp0s3.[ 24.019839] audit: type=1400 audit(1707113302.472:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=457 comm="apparmor_parser"
Starting networking.service - Raise network interfaces...
[ 24.255671] audit: type=1400 audit(1707113302.472:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=457 comm="apparmor_parser"
[ OK ] Finished systemd-tmpfiles-…te Volatile Files and Directories.
[ 24.361355] audit: type=1400 audit(1707113302.472:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=457 comm="apparmor_parser"
Starting systemd-timesyncd… - Network Time Synchronization...
Starting systemd-update-ut…rd System Boot/Shutdown in UTMP...
[ OK ] Finished systemd-update-ut…cord System Boot/Shutdown in UTMP.
[ OK ] Finished networking.service - Raise network interfaces.
[ OK ] Reached target network.target - Network.
[ OK ] Started systemd-timesyncd.…0m - Network Time Synchronization.
[ OK ] Reached target sysinit.target - System Initialization.
[ OK ] Started etckeeper.timermit of changes in /etc directory.
[ OK ] Started systemd-tmpfiles-c… Cleanup of Temporary Directories.
[ OK ] Reached target time-set.target - System Time Set.
[ OK ] Started apt-daily.timer - Daily apt download activities.[ 46.187450] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 46.187522] rcu: 0-...!: (5250 ticks this GP) idle=3774/1/0x4000000000000000 softirq=12350/12350 fqs=0
[ 46.187522] (t=5250 jiffies g=8669 q=7 ncpus=1)
[ 46.187522] rcu: rcu_preempt kthread timer wakeup didn't happen for 5249 jiffies! g8669 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 46.187522] rcu: Possible timer handling issue on cpu=0 timer-softirq=2282
[ 46.187522] rcu: rcu_preempt kthread starved for 5250 jiffies! g8669 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[ 46.187522] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 46.187522] rcu: RCU grace-period kthread stack dump:
[ 46.187522] task:rcu_preempt state:I stack:0 pid:15 ppid:2 flags:0x00004000
[ 46.187522] Call Trace:
[ 46.187522] <TASK>
[ 46.187522] __schedule+0x34d/0x9e0
[ 46.187522] ? rcu_gp_cleanup+0x460/0x460
[ 46.187522] schedule+0x5a/0xd0
[ 46.187522] schedule_timeout+0x94/0x150
[ 46.187522] ? __bpf_trace_tick_stop+0x10/0x10
[ 46.187522] rcu_gp_fqs_loop+0x141/0x550
[ 46.187522] rcu_gp_kthread+0xd0/0x190
[ 46.187522] kthread+0xda/0x100
[ 46.187522] ? kthread_complete_and_exit+0x20/0x20
[ 46.187522] ret_from_fork+0x22/0x30
[ 46.187522] </TASK>
[ 46.187522] rcu: Stack dump where RCU GP kthread last ran:
[ 46.187522] CPU: 0 PID: 487 Comm: ip Not tainted 6.1.0-17-amd64 #1 Debian 6.1.69-1
[ 46.187522] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 46.187522] RIP: 0010:virtqueue_get_buf_ctx_split+0x94/0xd0 [virtio_ring]
[ 46.187522] Code: 42 fe ff ff 0f b7 43 58 83 c0 01 66 89 43 58 f6 83 80 00 00 00 01 75 12 80 7b 4a 00 48 8b 4b 70 8b 53 60 74 0f 66 87 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
[ 46.187522] RSP: 0018:ffff960c408135c8 EFLAGS: 00000246
[ 46.187522] RAX: 0000000000000000 RBX: ffff88e04e976100 RCX: 0000000000000001
[ 46.187522] RDX: 0000000000000000 RSI: ffff960c408135e4 RDI: ffff88e04e976100
[ 46.187522] RBP: 0000000000000000 R08: 0000000000000004 R09: ffff88e0034fa980
[ 46.187522] R10: 0000000000000003 R11: ffff960c40813628 R12: 0000000000000002
[ 46.187522] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[ 46.187522] FS: 00007f11d16da2c0(0000) GS:ffff88e07dc00000(0000) knlGS:0000000000000000
[ 46.187522] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 46.187522] CR2: 00007f11d17ff8d0 CR3: 0000000004ac6000 CR4: 00000000000006f0
[ 46.187522] Call Trace:
[ 46.187522] <IRQ>
[ 46.187522] ? rcu_check_gp_kthread_starvation+0xec/0xfd
[ 46.187522] ? rcu_sched_clock_irq.cold+0xe3/0x459
[ 46.187522] ? update_load_avg+0x7e/0x780
[ 46.187522] ? sched_slice+0x87/0x140
[ 46.187522] ? timekeeping_update+0xdd/0x130
[ 46.187522] ? timekeeping_advance+0x377/0x570
[ 46.187522] ? update_process_times+0x70/0xb0
[ 46.187522] ? tick_sched_handle+0x22/0x60
[ 46.187522] ? tick_sched_timer+0x63/0x80
[ 46.187522] ? tick_sched_do_timer+0xa0/0xa0
[ 46.187522] ? __hrtimer_run_queues+0x112/0x2b0
[ 46.187522] ? hrtimer_interrupt+0xf4/0x210
[ 46.187522] ? __sysvec_apic_timer_interrupt+0x5d/0x110
[ 46.187522] ? sysvec_apic_timer_interrupt+0x69/0x90
[ 46.187522] </IRQ>
[ 46.187522] <TASK>
[ 46.187522] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 46.187522] ? virtqueue_get_buf_ctx_split+0x94/0xd0 [virtio_ring]
[ 46.187522] virtnet_send_command+0x18e/0x1e0 [virtio_net]
[ 46.187522] virtnet_set_rx_mode+0xd4/0x2d0 [virtio_net]
[ 46.187522] __dev_open+0x12b/0x1a0
[ 46.187522] __dev_change_flags+0x1d2/0x240
[ 46.187522] dev_change_flags+0x22/0x60
[ 46.187522] do_setlink+0x37c/0x12b0
[ 46.187522] ? __nla_validate_parse+0x61/0xc00
[ 46.187522] __rtnl_newlink+0x623/0x9e0
[ 46.187522] ? __kmem_cache_alloc_node+0x191/0x2a0
[ 46.187522] rtnl_newlink+0x43/0x70
[ 46.187522] rtnetlink_rcv_msg+0x14e/0x3b0
[ 46.187522] ? __kmem_cache_alloc_node+0x191/0x2a0
[ 46.187522] ? __alloc_skb+0x88/0x1a0
[ 46.187522] ? rtnl_calcit.isra.0+0x140/0x140
[ 46.187522] netlink_rcv_skb+0x51/0x100
[ 46.187522] netlink_unicast+0x24a/0x390
[ 46.187522] netlink_sendmsg+0x250/0x4c0
[ 46.187522] __sock_sendmsg+0x5f/0x70
[ 46.187522] ____sys_sendmsg+0x277/0x2f0
[ 46.187522] ? copy_msghdr_from_user+0x7d/0xc0
[ 46.187522] ___sys_sendmsg+0x9a/0xe0
[ 46.187522] __sys_sendmsg+0x76/0xc0
[ 46.187522] do_syscall_64+0x5b/0xc0
[ 46.187522] ? exit_to_user_mode_prepare+0x40/0x1e0
[ 46.187522] ? syscall_exit_to_user_mode+0x27/0x40
[ 46.187522] ? do_syscall_64+0x67/0xc0
[ 46.187522] ? do_user_addr_fault+0x1b0/0x580
[ 46.187522] ? exit_to_user_mode_prepare+0x40/0x1e0
[ 46.187522] entry_SYSCALL_64_after_hwframe+0x64/0xce
[ 46.187522] RIP: 0033:0x7f11d1811af0
[ 46.187522] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 66 2e 0f 1f 84 00 00 00 00 00 90 80 3d f1 fa 0c 00 00 74 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 54
[ 46.187522] RSP: 002b:00007ffe21b533a8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[ 46.187522] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f11d1811af0
[ 46.187522] RDX: 0000000000000000 RSI: 00007ffe21b53410 RDI: 0000000000000003
[ 46.187522] RBP: 0000000000000003 R08: 0000000065c07b57 R09: 00005580e154e2a0
[ 46.187522] R10: 00007ffe21b52e34 R11: 0000000000000202 R12: 0000000065c07b58
[ 46.187522] R13: 00005580e016e020 R14: 0000000000000001 R15: 0000000000000000
[ 46.187522] </TASK>
[ 46.187522] CPU: 0 PID: 487 Comm: ip Not tainted 6.1.0-17-amd64 #1 Debian 6.1.69-1
[ 46.187522] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 46.187522] RIP: 0010:virtqueue_get_buf_ctx_split+0x94/0xd0 [virtio_ring]
[ 46.187522] Code: 42 fe ff ff 0f b7 43 58 83 c0 01 66 89 43 58 f6 83 80 00 00 00 01 75 12 80 7b 4a 00 48 8b 4b 70 8b 53 60 74 0f 66 87 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
[ 46.187522] RSP: 0018:ffff960c408135c8 EFLAGS: 00000246
[ 46.187522] RAX: 0000000000000000 RBX: ffff88e04e976100 RCX: 0000000000000001
[ 46.187522] RDX: 0000000000000000 RSI: ffff960c408135e4 RDI: ffff88e04e976100
[ 46.187522] RBP: 0000000000000000 R08: 0000000000000004 R09: ffff88e0034fa980
[ 46.187522] R10: 0000000000000003 R11: ffff960c40813628 R12: 0000000000000002
[ 46.187522] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[ 46.187522] FS: 00007f11d16da2c0(0000) GS:ffff88e07dc00000(0000) knlGS:0000000000000000
[ 46.187522] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 46.187522] CR2: 00007f11d17ff8d0 CR3: 0000000004ac6000 CR4: 00000000000006f0
[ 46.187522] Call Trace:
[ 46.187522] <IRQ>
[ 46.187522] ? rcu_dump_cpu_stacks+0xa4/0xe0
[ 46.187522] ? rcu_sched_clock_irq.cold+0xe8/0x459
[ 46.187522] ? update_load_avg+0x7e/0x780
[ 46.187522] ? sched_slice+0x87/0x140
[ 46.187522] ? timekeeping_update+0xdd/0x130
[ 46.187522] ? timekeeping_advance+0x377/0x570
[ 46.187522] ? update_process_times+0x70/0xb0
[ 46.187522] ? tick_sched_handle+0x22/0x60
[ 46.187522] ? tick_sched_timer+0x63/0x80
[ 46.187522] ? tick_sched_do_timer+0xa0/0xa0
[ 46.187522] ? __hrtimer_run_queues+0x112/0x2b0
[ 46.187522] ? hrtimer_interrupt+0xf4/0x210
[ 46.187522] ? __sysvec_apic_timer_interrupt+0x5d/0x110
[ 46.187522] ? sysvec_apic_timer_interrupt+0x69/0x90
[ 46.187522] </IRQ>
[ 46.187522] <TASK>
[ 46.187522] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 46.187522] ? virtqueue_get_buf_ctx_split+0x94/0xd0 [virtio_ring]
[ 46.187522] virtnet_send_command+0x18e/0x1e0 [virtio_net]
[ 46.187522] virtnet_set_rx_mode+0xd4/0x2d0 [virtio_net]
[ 46.187522] __dev_open+0x12b/0x1a0
[ 46.187522] __dev_change_flags+0x1d2/0x240
[ 46.187522] dev_change_flags+0x22/0x60
[ 46.187522] do_setlink+0x37c/0x12b0
[ 46.187522] ? __nla_validate_parse+0x61/0xc00
[ 46.187522] __rtnl_newlink+0x623/0x9e0
[ 46.187522] ? __kmem_cache_alloc_node+0x191/0x2a0
[ 46.187522] rtnl_newlink+0x43/0x70
[ 46.187522] rtnetlink_rcv_msg+0x14e/0x3b0
[ 46.187522] ? __kmem_cache_alloc_node+0x191/0x2a0
[ 46.187522] ? __alloc_skb+0x88/0x1a0
[ 46.187522] ? rtnl_calcit.isra.0+0x140/0x140
[ 46.187522] netlink_rcv_skb+0x51/0x100
[ 46.187522] netlink_unicast+0x24a/0x390
[ 46.187522] netlink_sendmsg+0x250/0x4c0
[ 46.187522] __sock_sendmsg+0x5f/0x70
[ 46.187522] ____sys_sendmsg+0x277/0x2f0
[ 46.187522] ? copy_msghdr_from_user+0x7d/0xc0
[ 46.187522] ___sys_sendmsg+0x9a/0xe0
[ 46.187522] __sys_sendmsg+0x76/0xc0
[ 46.187522] do_syscall_64+0x5b/0xc0
[ 46.187522] ? exit_to_user_mode_prepare+0x40/0x1e0
[ 46.187522] ? syscall_exit_to_user_mode+0x27/0x40
[ 46.187522] ? do_syscall_64+0x67/0xc0
[ 46.187522] ? do_user_addr_fault+0x1b0/0x580
[ 46.187522] ? exit_to_user_mode_prepare+0x40/0x1e0
[ 46.187522] entry_SYSCALL_64_after_hwframe+0x64/0xce
[ 46.187522] RIP: 0033:0x7f11d1811af0
[ 46.187522] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 66 2e 0f 1f 84 00 00 00 00 00 90 80 3d f1 fa 0c 00 00 74 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 54
[ 46.187522] RSP: 002b:00007ffe21b533a8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[ 46.187522] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f11d1811af0
[ 46.187522] RDX: 0000000000000000 RSI: 00007ffe21b53410 RDI: 0000000000000003
[ 46.187522] RBP: 0000000000000003 R08: 0000000065c07b57 R09: 00005580e154e2a0
[ 46.187522] R10: 00007ffe21b52e34 R11: 0000000000000202 R12: 0000000065c07b58
[ 46.187522] R13: 00005580e016e020 R14: 0000000000000001 R15: 0000000000000000
[ 46.187522] </TASK>
Steps to reproduce
- Create the following nested passthrough configuration
- Attempt to configure the L1 network hostdev interface inside the L2 guest
Any attempt will cause the kernel panics documented.
Additional information
L2 Nested Virtual Guest Machine
CPU: single core QEMU Virtual version 2.5+ speed: 2700 MHz
Kernel: 6.1.0-17-amd64 x86_64 Up: 2m Mem: 263.1/1967.2 MiB (13.4%)
Storage: 20 GiB (11.5% used) Procs: 111 Shell: Bash inxi: 3.3.26
This machine is launched with the following qemu command line taken from the Qemu Wiki's VT-d feature page. A couple options (serial console and qemu monitor) were added to debug and capture information:
sudo qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split -m 2G \
-device intel-iommu,intremap=on,caching-mode=on \
-serial telnet:localhost:4321,server,nowait \
-monitor telnet:127.0.0.1:1234,server,nowait \
-device vfio-pci,host=08:00.0 \
$IMAGE_PATH
NOTE: The kernel panic was occurring before adding the CLI options to troubleshoot
As a control test, and to capture more information I removed the /etc/network/interfaces.d
configuration file in the L2 guest for enp0s3
(the macvtap hostdev passed through from L1). The passthrough network interface is not being engaged yet the pass through seems to occur. The L2 guest boots to the console prompt and here's what I see from various sources:
lspci -nnk -d 1af4:1041
00:03.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
Subsystem: Red Hat, Inc. Virtio 1.0 network device [1af4:1100]
Kernel driver in use: virtio-pci
Kernel modules: virtio_pci
# -------------------------------------------------------------------------
ip a s enp0s3
2: enp0s3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 52:54:00:76:26:c2 brd ff:ff:ff:ff:ff:ff
# -------------------------------------------------------------------------
dmesg | grep -E '(1af4:1041|pci|PCI)'
[ 0.078897] [mem 0xc0000000-0xfed1bfff] available for PCI devices
[ 2.963689] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[ 2.972420] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0xb0000000-0xbfffffff] (base 0xb0000000)
[ 2.979385] PCI: MMCONFIG at [mem 0xb0000000-0xbfffffff] reserved in E820
[ 2.987428] PCI: Using configuration type 1 for base access
[ 3.147465] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[ 3.155367] PCI: Using E820 reservations for host bridge windows
[ 3.175112] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
[ 3.187468] acpi PNP0A08:00: _OSC: platform does not support [PCIeHotplug LTR]
[ 3.195465] acpi PNP0A08:00: _OSC: OS now controls [SHPCHotplug PME AER PCIeCapability]
[ 3.205666] PCI host bridge to bus 0000:00
[ 3.210289] pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window]
[ 3.215381] pci_bus 0000:00: root bus resource [io 0x0d00-0xffff window]
[ 3.223372] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[ 3.235370] pci_bus 0000:00: root bus resource [mem 0x80000000-0xafffffff window]
[ 3.243380] pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window]
[ 3.251374] pci_bus 0000:00: root bus resource [mem 0x100000000-0x8ffffffff window]
[ 3.259380] pci_bus 0000:00: root bus resource [bus 00-ff]
[ 3.268046] pci 0000:00:00.0: [8086:29c0] type 00 class 0x060000
[ 3.281413] pci 0000:00:01.0: [1234:1111] type 00 class 0x030000
[ 3.299526] pci 0000:00:01.0: reg 0x10: [mem 0xfd000000-0xfdffffff pref]
[ 3.331551] pci 0000:00:01.0: reg 0x18: [mem 0xfebd4000-0xfebd4fff]
[ 3.377967] pci 0000:00:01.0: reg 0x30: [mem 0xfebc0000-0xfebcffff pref]
[ 3.387534] pci 0000:00:01.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[ 3.395377] pci 0000:00:01.0: pci_fixup_video+0x0/0xe0 took 11718 usecs
[ 3.413407] pci 0000:00:02.0: [8086:10d3] type 00 class 0x020000
[ 3.431371] pci 0000:00:02.0: reg 0x10: [mem 0xfeb80000-0xfeb9ffff]
[ 3.447375] pci 0000:00:02.0: reg 0x14: [mem 0xfeba0000-0xfebbffff]
[ 3.463375] pci 0000:00:02.0: reg 0x18: [io 0xc040-0xc05f]
[ 3.482783] pci 0000:00:02.0: reg 0x1c: [mem 0xfebd0000-0xfebd3fff]
[ 3.507372] pci 0000:00:02.0: reg 0x30: [mem 0xfeb00000-0xfeb3ffff pref]
[ 3.532741] pci 0000:00:03.0: [1af4:1041] type 00 class 0x020000
[ 3.567530] pci 0000:00:03.0: reg 0x14: [mem 0xfebd5000-0xfebd5fff]
[ 3.611577] pci 0000:00:03.0: reg 0x20: [mem 0xfe000000-0xfe003fff 64bit pref]
[ 3.627533] pci 0000:00:03.0: reg 0x30: [mem 0xfeb40000-0xfeb7ffff pref]
[ 3.664363] pci 0000:00:1f.0: [8086:2918] type 00 class 0x060100
[ 3.683614] pci 0000:00:1f.0: quirk: [io 0x0600-0x067f] claimed by ICH6 ACPI/GPIO/TCO
[ 3.696858] pci 0000:00:1f.2: [8086:2922] type 00 class 0x010601
[ 3.731379] pci 0000:00:1f.2: reg 0x20: [io 0xc060-0xc07f]
[ 3.745644] pci 0000:00:1f.2: reg 0x24: [mem 0xfebd6000-0xfebd6fff]
[ 3.769352] pci 0000:00:1f.3: [8086:2930] type 00 class 0x0c0500
[ 3.785828] pci 0000:00:1f.3: reg 0x20: [io 0x0700-0x073f]
[ 3.799205] ACPI: PCI: Interrupt link LNKA configured for IRQ 10
[ 3.804526] ACPI: PCI: Interrupt link LNKB configured for IRQ 10
[ 3.815394] ACPI: PCI: Interrupt link LNKC configured for IRQ 11
[ 3.826133] ACPI: PCI: Interrupt link LNKD configured for IRQ 11
[ 3.832125] ACPI: PCI: Interrupt link LNKE configured for IRQ 10
[ 3.843767] ACPI: PCI: Interrupt link LNKF configured for IRQ 10
[ 3.851918] ACPI: PCI: Interrupt link LNKG configured for IRQ 11
[ 3.859954] ACPI: PCI: Interrupt link LNKH configured for IRQ 11
[ 3.868418] ACPI: PCI: Interrupt link GSIA configured for IRQ 16
[ 3.875399] ACPI: PCI: Interrupt link GSIB configured for IRQ 17
[ 3.883392] ACPI: PCI: Interrupt link GSIC configured for IRQ 18
[ 3.891410] ACPI: PCI: Interrupt link GSID configured for IRQ 19
[ 3.903400] ACPI: PCI: Interrupt link GSIE configured for IRQ 20
[ 3.911400] ACPI: PCI: Interrupt link GSIF configured for IRQ 21
[ 3.919409] ACPI: PCI: Interrupt link GSIG configured for IRQ 22
[ 3.927391] ACPI: PCI: Interrupt link GSIH configured for IRQ 23
[ 4.019382] PCI: Using ACPI for IRQ routing
[ 5.408362] PCI: pci_cache_line_size set to 64 bytes
[ 5.412850] pci 0000:00:01.0: vgaarb: setting as boot VGA device
[ 5.415363] pci 0000:00:01.0: vgaarb: bridge control possible
[ 5.415363] pci 0000:00:01.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[ 5.929645] pci_bus 0000:00: resource 4 [io 0x0000-0x0cf7 window]
[ 5.964276] pci_bus 0000:00: resource 5 [io 0x0d00-0xffff window]
[ 5.988468] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
[ 6.028377] pci_bus 0000:00: resource 7 [mem 0x80000000-0xafffffff window]
[ 6.067298] pci_bus 0000:00: resource 8 [mem 0xc0000000-0xfebfffff window]
[ 6.099053] pci_bus 0000:00: resource 9 [mem 0x100000000-0x8ffffffff window]
[ 6.144239] PCI: CLS 0 bytes, default 64
[ 6.456800] pci 0000:00:00.0: Adding to iommu group 0
[ 6.484962] pci 0000:00:01.0: Adding to iommu group 1
[ 6.516735] pci 0000:00:02.0: Adding to iommu group 2
[ 6.552837] pci 0000:00:03.0: Adding to iommu group 3
[ 6.604983] pci 0000:00:1f.0: Adding to iommu group 4
[ 6.644925] pci 0000:00:1f.2: Adding to iommu group 4
[ 6.688841] pci 0000:00:1f.3: Adding to iommu group 4
[ 7.882058] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[ 9.913524] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt
[ 10.843365] e1000e 0000:00:02.0 eth0: (PCI Express:2.5GT/s:Width x1) 52:54:00:12:34:56
So the L1 host's network device at PCI location 08:00.0
is successfully passed through to to the L2 guest and appears at PCI location 00:03.0
. Without the pass through option on the Qemu command line this device is non-existent. Furthermore attempting to ifconfig
the interface causes another panic to occur but this time on virtnet_send_command
. The ifup
panic was on virtqueue_get_buf_ctx_split
:
root@deb12:~# ifconfig enp0s3
enp0s3: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 52:54:00:76:26:c2 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
root@deb12:~# ifconfig enp0s3 198.18.1.125
[ 1514.564631] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 1514.564652] rcu: 0-...!: (5250 ticks this GP) idle=cfa4/1/0x4000000000000000 softirq=16110/16110 fqs=0
[ 1514.564652] (t=5250 jiffies g=13409 q=63 ncpus=1)
[ 1514.564652] rcu: rcu_preempt kthread timer wakeup didn't happen for 5249 jiffies! g13409 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 1514.564652] rcu: Possible timer handling issue on cpu=0 timer-softirq=20150
[ 1514.564652] rcu: rcu_preempt kthread starved for 5250 jiffies! g13409 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[ 1514.564652] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 1514.564652] rcu: RCU grace-period kthread stack dump:
[ 1514.564652] task:rcu_preempt state:I stack:0 pid:15 ppid:2 flags:0x00004000
[ 1514.564652] Call Trace:
[ 1514.564652] <TASK>
[ 1514.564652] __schedule+0x34d/0x9e0
[ 1514.564652] ? rcu_gp_cleanup+0x460/0x460
[ 1514.564652] schedule+0x5a/0xd0
[ 1514.564652] schedule_timeout+0x94/0x150
[ 1514.564652] ? __bpf_trace_tick_stop+0x10/0x10
[ 1514.564652] rcu_gp_fqs_loop+0x141/0x550
[ 1514.564652] rcu_gp_kthread+0xd0/0x190
[ 1514.564652] kthread+0xda/0x100
[ 1514.564652] ? kthread_complete_and_exit+0x20/0x20
[ 1514.564652] ret_from_fork+0x22/0x30
[ 1514.564652] </TASK>
[ 1514.564652] rcu: Stack dump where RCU GP kthread last ran:
[ 1514.564652] CPU: 0 PID: 706 Comm: ifconfig Not tainted 6.1.0-17-amd64 #1 Debian 6.1.69-1
[ 1514.564652] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 1514.564652] RIP: 0010:virtnet_send_command+0x180/0x1e0 [virtio_net]
[ 1514.564652] Code: b6 a8 e4 ff 85 c0 0f 88 9a 77 00 00 48 8b 7b 08 e8 15 8d e4 ff 84 c0 75 11 eb 53 48 8b 7b 08 e8 d6 79 e4 ff 84 c0 75 15 f3 90 <48> 8b 7b 08 48 8d 74 24 04 e8 d2 87 e4 ff 48 85 c0 74 de 48 8b 83
[ 1514.564652] RSP: 0018:ffffabb68208fa20 EFLAGS: 00000246
[ 1514.564652] RAX: 0000000000000000 RBX: ffff955483bda980 RCX: 0000000000000001
[ 1514.564652] RDX: 0000000000000000 RSI: ffffabb68208fa24 RDI: ffff9554c7414e00
[ 1514.564652] RBP: ffffabb68208fa48 R08: 0000000000000004 R09: ffff955483bda980
[ 1514.564652] R10: 0000000000000003 R11: ffffabb68208fa68 R12: 0000000000000002
[ 1514.564652] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[ 1514.564652] FS: 00007febafdb6740(0000) GS:ffff9554fdc00000(0000) knlGS:0000000000000000
[ 1514.564652] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1514.564652] CR2: 00007febafee0f50 CR3: 0000000047f84000 CR4: 00000000000006f0
[ 1514.564652] Call Trace:
[ 1514.564652] <IRQ>
[ 1514.564652] ? rcu_check_gp_kthread_starvation+0xec/0xfd
[ 1514.564652] ? rcu_sched_clock_irq.cold+0xe3/0x459
[ 1514.564652] ? update_load_avg+0x7e/0x780
[ 1514.564652] ? sched_slice+0x87/0x140
[ 1514.564652] ? timekeeping_update+0xdd/0x130
[ 1514.564652] ? timekeeping_advance+0x377/0x570
[ 1514.564652] ? update_process_times+0x70/0xb0
[ 1514.564652] ? tick_sched_handle+0x22/0x60
[ 1514.564652] ? tick_sched_timer+0x63/0x80
[ 1514.564652] ? tick_sched_do_timer+0xa0/0xa0
[ 1514.564652] ? __hrtimer_run_queues+0x112/0x2b0
[ 1514.564652] ? hrtimer_interrupt+0xf4/0x210
[ 1514.564652] ? __sysvec_apic_timer_interrupt+0x5d/0x110
[ 1514.564652] ? sysvec_apic_timer_interrupt+0x69/0x90
[ 1514.564652] </IRQ>
[ 1514.564652] <TASK>
[ 1514.564652] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 1514.564652] ? virtnet_send_command+0x180/0x1e0 [virtio_net]
[ 1514.564652] virtnet_set_rx_mode+0xd4/0x2d0 [virtio_net]
[ 1514.564652] __dev_open+0x12b/0x1a0
[ 1514.564652] __dev_change_flags+0x1d2/0x240
[ 1514.564652] ? fib_inetaddr_event+0x85/0xd0
[ 1514.564652] dev_change_flags+0x22/0x60
[ 1514.564652] devinet_ioctl+0x396/0x7c0
[ 1514.564652] inet_ioctl+0x1ae/0x1e0
[ 1514.564652] sock_do_ioctl+0x7e/0x120
[ 1514.564652] sock_ioctl+0xed/0x330
[ 1514.564652] ? _copy_to_user+0x21/0x30
[ 1514.564652] ? put_user_ifreq+0x5f/0x70
[ 1514.564652] __x64_sys_ioctl+0x90/0xd0
[ 1514.564652] do_syscall_64+0x5b/0xc0
[ 1514.564652] ? exit_to_user_mode_prepare+0x40/0x1e0
[ 1514.564652] ? syscall_exit_to_user_mode+0x27/0x40
[ 1514.564652] ? do_syscall_64+0x67/0xc0
[ 1514.564652] ? exit_to_user_mode_prepare+0x40/0x1e0
[ 1514.564652] entry_SYSCALL_64_after_hwframe+0x64/0xce
[ 1514.564652] RIP: 0033:0x7febafeb6c5b
[ 1514.564652] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1514.564652] RSP: 002b:00007ffe9b493b80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1514.564652] RAX: ffffffffffffffda RBX: 00007ffe9b493c80 RCX: 00007febafeb6c5b
[ 1514.564652] RDX: 00007ffe9b493be0 RSI: 0000000000008914 RDI: 0000000000000004
[ 1514.564652] RBP: 00007ffe9b493be0 R08: 0000000000000008 R09: 0000000000000000
[ 1514.564652] R10: 00007febafdd0358 R11: 0000000000000246 R12: 0000000000000041
[ 1514.564652] R13: 00007ffe9b493c80 R14: 00007ffe9b493f98 R15: 00007febaffd2020
[ 1514.564652] </TASK>
[ 1514.564652] CPU: 0 PID: 706 Comm: ifconfig Not tainted 6.1.0-17-amd64 #1 Debian 6.1.69-1
[ 1514.564652] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 1514.564652] RIP: 0010:virtnet_send_command+0x180/0x1e0 [virtio_net]
[ 1514.564652] Code: b6 a8 e4 ff 85 c0 0f 88 9a 77 00 00 48 8b 7b 08 e8 15 8d e4 ff 84 c0 75 11 eb 53 48 8b 7b 08 e8 d6 79 e4 ff 84 c0 75 15 f3 90 <48> 8b 7b 08 48 8d 74 24 04 e8 d2 87 e4 ff 48 85 c0 74 de 48 8b 83
[ 1514.564652] RSP: 0018:ffffabb68208fa20 EFLAGS: 00000246
[ 1514.564652] RAX: 0000000000000000 RBX: ffff955483bda980 RCX: 0000000000000001
[ 1514.564652] RDX: 0000000000000000 RSI: ffffabb68208fa24 RDI: ffff9554c7414e00
[ 1514.564652] RBP: ffffabb68208fa48 R08: 0000000000000004 R09: ffff955483bda980
[ 1514.564652] R10: 0000000000000003 R11: ffffabb68208fa68 R12: 0000000000000002
[ 1514.564652] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[ 1514.564652] FS: 00007febafdb6740(0000) GS:ffff9554fdc00000(0000) knlGS:0000000000000000
[ 1514.564652] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1514.564652] CR2: 00007febafee0f50 CR3: 0000000047f84000 CR4: 00000000000006f0
[ 1514.564652] Call Trace:
[ 1514.564652] <IRQ>
[ 1514.564652] ? rcu_dump_cpu_stacks+0xa4/0xe0
[ 1514.564652] ? rcu_sched_clock_irq.cold+0xe8/0x459
[ 1514.564652] ? update_load_avg+0x7e/0x780
[ 1514.564652] ? sched_slice+0x87/0x140
[ 1514.564652] ? timekeeping_update+0xdd/0x130
[ 1514.564652] ? timekeeping_advance+0x377/0x570
[ 1514.564652] ? update_process_times+0x70/0xb0
[ 1514.564652] ? tick_sched_handle+0x22/0x60
[ 1514.564652] ? tick_sched_timer+0x63/0x80
[ 1514.564652] ? tick_sched_do_timer+0xa0/0xa0
[ 1514.564652] ? __hrtimer_run_queues+0x112/0x2b0
[ 1514.564652] ? hrtimer_interrupt+0xf4/0x210
[ 1514.564652] ? __sysvec_apic_timer_interrupt+0x5d/0x110
[ 1514.564652] ? sysvec_apic_timer_interrupt+0x69/0x90
[ 1514.564652] </IRQ>
[ 1514.564652] <TASK>
[ 1514.564652] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 1514.564652] ? virtnet_send_command+0x180/0x1e0 [virtio_net]
[ 1514.564652] virtnet_set_rx_mode+0xd4/0x2d0 [virtio_net]
[ 1514.564652] __dev_open+0x12b/0x1a0
[ 1514.564652] __dev_change_flags+0x1d2/0x240
[ 1514.564652] ? fib_inetaddr_event+0x85/0xd0
[ 1514.564652] dev_change_flags+0x22/0x60
[ 1514.564652] devinet_ioctl+0x396/0x7c0
[ 1514.564652] inet_ioctl+0x1ae/0x1e0
[ 1514.564652] sock_do_ioctl+0x7e/0x120
[ 1514.564652] sock_ioctl+0xed/0x330
[ 1514.564652] ? _copy_to_user+0x21/0x30
[ 1514.564652] ? put_user_ifreq+0x5f/0x70
[ 1514.564652] __x64_sys_ioctl+0x90/0xd0
[ 1514.564652] do_syscall_64+0x5b/0xc0
[ 1514.564652] ? exit_to_user_mode_prepare+0x40/0x1e0
[ 1514.564652] ? syscall_exit_to_user_mode+0x27/0x40
[ 1514.564652] ? do_syscall_64+0x67/0xc0
[ 1514.564652] ? exit_to_user_mode_prepare+0x40/0x1e0
[ 1514.564652] entry_SYSCALL_64_after_hwframe+0x64/0xce
[ 1514.564652] RIP: 0033:0x7febafeb6c5b
[ 1514.564652] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1514.564652] RSP: 002b:00007ffe9b493b80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1514.564652] RAX: ffffffffffffffda RBX: 00007ffe9b493c80 RCX: 00007febafeb6c5b
[ 1514.564652] RDX: 00007ffe9b493be0 RSI: 0000000000008914 RDI: 0000000000000004
[ 1514.564652] RBP: 00007ffe9b493be0 R08: 0000000000000008 R09: 0000000000000000
[ 1514.564652] R10: 00007febafdd0358 R11: 0000000000000246 R12: 0000000000000041
[ 1514.564652] R13: 00007ffe9b493c80 R14: 00007ffe9b493f98 R15: 00007febaffd2020
[ 1514.564652] </TASK>
L1 - Virtual Hypervisor Machine
inxi
CPU: 8x 1-core Intel Xeon E5-2697 v2 (-SMP-) speed: 2700 MHz Kernel: 6.6.10-1-MANJARO x86_64
Up: 20m Mem: 2.62/15.62 GiB (16.8%) Storage: 100 GiB (52.1% used) Procs: 267 Shell: Bash inxi: 3.3.32
virt-host-validation
QEMU: Checking for hardware virtualization : PASS
QEMU: Checking if device /dev/kvm exists : PASS
QEMU: Checking if device /dev/kvm is accessible : PASS
QEMU: Checking if device /dev/vhost-net exists : PASS
QEMU: Checking if device /dev/net/tun exists : PASS
QEMU: Checking for cgroup 'cpu' controller support : PASS
QEMU: Checking for cgroup 'cpuacct' controller support : PASS
QEMU: Checking for cgroup 'cpuset' controller support : PASS
QEMU: Checking for cgroup 'memory' controller support : PASS
QEMU: Checking for cgroup 'devices' controller support : PASS
QEMU: Checking for cgroup 'blkio' controller support : PASS
QEMU: Checking for device assignment IOMMU support : PASS
QEMU: Checking if IOMMU is enabled by kernel : PASS
QEMU: Checking for secure guest support : WARN (Unknown if this
platform has Secure Guest support)
LXC: Checking for Linux >= 2.6.26 : PASS
LXC: Checking for namespace ipc : PASS
LXC: Checking for namespace mnt : PASS
LXC: Checking for namespace pid : PASS
LXC: Checking for namespace uts : PASS
LXC: Checking for namespace net : PASS
LXC: Checking for namespace user : PASS
LXC: Checking for cgroup 'cpu' controller support : PASS
LXC: Checking for cgroup 'cpuacct' controller support : PASS
LXC: Checking for cgroup 'cpuset' controller support : PASS
LXC: Checking for cgroup 'memory' controller support : PASS
LXC: Checking for cgroup 'devices' controller support : PASS
LXC: Checking for cgroup 'freezer' controller support : PASS
LXC: Checking for cgroup 'blkio' controller support : PASS
LXC: Checking if device /sys/fs/fuse/connections exists : PASS
CH: Checking for hardware virtualization : PASS
CH: Checking if device /dev/kvm exists : PASS
CH: Checking if device /dev/kvm is accessible : PASS
Runs using libvirt with OVMF / EFI with host-passthrough
and is configured with:
- vIOMMU device from the libvirt domain definition:
<iommu model='intel'>
<driver intremap='on' eim='on' iotlb='on'/>
</iommu>
- iommu via kernel parameters in the
/etc/default/grub
configuration file:
cat /etc/default/grub | grep LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="resume=UUID=67d043d0-7046-44d9-914e-256b2a515ba4 udev.log_priority=3 iommu=1 intel_iommu=on systemd.unified_cgroup_hierarchy=0 cgroup_enable=cpuset cgroup_enable=cpu cgroup_enable=devices cgroup_enable=freezer cgroup_enable=blkio"
- vfio via
modprobe.d/vfio.conf
configuration file
cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=1af4:1041
The virtual macvtap device to be passed through (08:00.0
) is released from the virtio-pci
driver and reclaimed by the vfio-pci
driver as expected from the module ids filter. See last entry compared to others of the same vendor and device id:
lspci -nnk -d 1af4:1041
01:00.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
Subsystem: Red Hat, Inc. Virtio 1.0 network device [1af4:1100]
Kernel driver in use: virtio-pci
Kernel modules: virtio_pci
02:00.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
Subsystem: Red Hat, Inc. Virtio 1.0 network device [1af4:1100]
Kernel driver in use: virtio-pci
Kernel modules: virtio_pci
08:00.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network device [1af4:1041] (rev 01)
Subsystem: Red Hat, Inc. Virtio 1.0 network device [1af4:1100]
Kernel driver in use: vfio-pci
Kernel modules: virtio_pci
The PCI passthrough configuration looks as one would expect and is ready for hostdev passthrough to the L2 guest.
L0 - Physical Bottom Host Machine
inxi
CPU: 12-core Intel Xeon E5-2697 v2 (-MT MCP-)
speed/min/max: 1308/1200/3500 MHz Kernel: 6.5.13-7-MANJARO x86_64
Up: 1d 14h 6m Mem: 45.29/125.77 GiB (36.0%) Storage: 60.95 TiB (0.1% used)
Procs: 521 Shell: Zsh inxi: 3.3.31
virt-host-validation
QEMU: Checking for hardware virtualization : PASS
QEMU: Checking if device /dev/kvm exists : PASS
QEMU: Checking if device /dev/kvm is accessible : PASS
QEMU: Checking if device /dev/vhost-net exists : PASS
QEMU: Checking if device /dev/net/tun exists : PASS
QEMU: Checking for cgroup 'cpu' controller support : PASS
QEMU: Checking for cgroup 'cpuacct' controller support : PASS
QEMU: Checking for cgroup 'cpuset' controller support : PASS
QEMU: Checking for cgroup 'memory' controller support : PASS
QEMU: Checking for cgroup 'devices' controller support : PASS
QEMU: Checking for cgroup 'blkio' controller support : PASS
QEMU: Checking for device assignment IOMMU support : PASS
QEMU: Checking if IOMMU is enabled by kernel : PASS
QEMU: Checking for secure guest support : WARN (Unknown if this
platform has Secure Guest support)
LXC: Checking for Linux >= 2.6.26 : PASS
LXC: Checking for namespace ipc : PASS
LXC: Checking for namespace mnt : PASS
LXC: Checking for namespace pid : PASS
LXC: Checking for namespace uts : PASS
LXC: Checking for namespace net : PASS
LXC: Checking for namespace user : PASS
LXC: Checking for cgroup 'cpu' controller support : PASS
LXC: Checking for cgroup 'cpuacct' controller support : PASS
LXC: Checking for cgroup 'cpuset' controller support : PASS
LXC: Checking for cgroup 'memory' controller support : PASS
LXC: Checking for cgroup 'devices' controller support : PASS
LXC: Checking for cgroup 'freezer' controller support : PASS
LXC: Checking for cgroup 'blkio' controller support : PASS
LXC: Checking if device /sys/fs/fuse/connections exists : PASS
CH: Checking for hardware virtualization : PASS
CH: Checking if device /dev/kvm exists : PASS
CH: Checking if device /dev/kvm is accessible : PASS