CR4.VMX leaks from L1 into L2 on Intel VMX

Host environment

  • Operating system: NixOS unstable
  • OS/kernel version: Linux 6.6.48
  • Architecture: x86_64 (Intel!)
  • QEMU flavor: qemu-system-x86_64
  • QEMU version: 8.2.6
  • QEMU command line:
    qemu-img create -f qcow2 test.qcow2 1G
    qemu-system-x86_64 -m 4096 -serial stdio -kernel minikernel -cpu host,+vmx -enable-kvm -drive file=test.qcow2,format=qcow2

Emulated/Virtualized environment

  • Operating system: Custom see below, but can also be reproduced with nested virtualization
  • OS/kernel version: N/A
  • Architecture: x86_64

Description of problem

In a nested virtualization setting, savevm can cause CR4 bits from leaking from L1 into L2. This causes general-protection faults in certain guests.

The L2 guest executes this code:

mov rax, cr4  ; Get CR4​
mov rcx, rax  ; Remember the old value​
btc rax, 7    ; Toggle CR4.PGE​
mov cr4, rax  ; #GP! <- Shouldn't happen!​
mov cr4, rcx  ; Restore old value

If the guest code is interrupted at the right time (e.g. via savevm), Qemu marks CR4 dirty while the guest executes L2 code. Due to really complicated KVM semantics, this will result in L1 CR4 bits (VMXE) leaking into the L2 guest and the L2 will die with a GP:

Instead of the expected CR4 value, the L2 guest reads a value with VMXE set. When it tries to write this back into CR4, this triggers the general protection fault.

Steps to reproduce

This is only an issue on Intel systems.

Build the minikernel

  1. Install Nix: https://nixos.org/download/
  2. git clone --branch cr4_test git@github.com:tpressure/guest-tests
  3. cd guest-tests
  4. nix-build -A tests.tinivisor.elf32
  5. cp result minikernel

For your convenience, here is a pre-built binary: minikernel.gz

Run the VM

  1. qemu-img create -f qcow2 test.qcow2 1G
  2. qemu-system-x86_64 -m 4096 -serial stdio -kernel minikernel -cpu host,+vmx -enable-kvm -drive file=test.qcow2,format=qcow2
  3. Wait for test case: test_tinivisor_nested_guest_should_never_see_vmxe_in_cr4 on the serial console. At this point the L2 guest is basically executing the assembly snippet above in a loop.
  4. Type savevm in the Qemu monitor (might not be 100% reliable, try it 2-3 times).
  5. See Invalid write to CR4: 0x2020 (CR4.VMXE set, while it shouldn't be)

Additional information

See also this discussion where we discussed a (flawed) approach to fixing this in KVM: https://lore.kernel.org/lkml/Zh6WlOB8CS-By3DQ@google.com/t/