CR4.VMX leaks from L1 into L2 on Intel VMX
Host environment
- Operating system: NixOS unstable
- OS/kernel version: Linux 6.6.48
- Architecture: x86_64 (Intel!)
- QEMU flavor: qemu-system-x86_64
- QEMU version: 8.2.6
- QEMU command line:
qemu-img create -f qcow2 test.qcow2 1G qemu-system-x86_64 -m 4096 -serial stdio -kernel minikernel -cpu host,+vmx -enable-kvm -drive file=test.qcow2,format=qcow2
Emulated/Virtualized environment
- Operating system: Custom see below, but can also be reproduced with nested virtualization
- OS/kernel version: N/A
- Architecture: x86_64
Description of problem
In a nested virtualization setting, savevm can cause CR4 bits from leaking from L1 into L2. This causes general-protection faults in certain guests.
The L2 guest executes this code:
mov rax, cr4 ; Get CR4
mov rcx, rax ; Remember the old value
btc rax, 7 ; Toggle CR4.PGE
mov cr4, rax ; #GP! <- Shouldn't happen!
mov cr4, rcx ; Restore old value
If the guest code is interrupted at the right time (e.g. via savevm), Qemu marks CR4 dirty while the guest executes L2 code. Due to really complicated KVM semantics, this will result in L1 CR4 bits (VMXE) leaking into the L2 guest and the L2 will die with a GP:
Instead of the expected CR4 value, the L2 guest reads a value with VMXE set. When it tries to write this back into CR4, this triggers the general protection fault.
Steps to reproduce
This is only an issue on Intel systems.
Build the minikernel
- Install Nix: https://nixos.org/download/
git clone --branch cr4_test git@github.com:tpressure/guest-testscd guest-testsnix-build -A tests.tinivisor.elf32cp result minikernel
For your convenience, here is a pre-built binary: minikernel.gz
Run the VM
qemu-img create -f qcow2 test.qcow2 1Gqemu-system-x86_64 -m 4096 -serial stdio -kernel minikernel -cpu host,+vmx -enable-kvm -drive file=test.qcow2,format=qcow2- Wait for
test case: test_tinivisor_nested_guest_should_never_see_vmxe_in_cr4on the serial console. At this point the L2 guest is basically executing the assembly snippet above in a loop. - Type
savevmin the Qemu monitor (might not be 100% reliable, try it 2-3 times). - See
Invalid write to CR4: 0x2020(CR4.VMXE set, while it shouldn't be)
Additional information
See also this discussion where we discussed a (flawed) approach to fixing this in KVM: https://lore.kernel.org/lkml/Zh6WlOB8CS-By3DQ@google.com/t/