Backport huge pfnmap support for significantly faster large mmap'd PCI device BARs

JIRA: https://issues.redhat.com/browse/RHEL-73613

Mapping and unmapping PCI BARs into VM va-space is a common VFIO function when assigning a PCIe device to a VM. Currently, the mapping requires setting up the BAR-mappings at a native pagesize level, e.g., 4K on x86, 64K possible on ARM64. This is done using PFNMAP support in the kernel, as these device pages are not backed by the standard page-struct that system memory is mapped under. For relatively small to medium sized BARs, the mapping performance isn't too noticeable; but with vGPUs and BARs in the gigabyte-size ranges, the map time can become a signficant amount of the VM startup time (minutes).

Upstream resolved this issue by providing huge PFNMAP support -- superpage PFNMAPs; 2MB, and even 1GB(on x86) pages vs 4K page mappings yields three to ten orders of magnitude speed up in the mapping operation, which is quite visible to the VM user of assigned GPUs in the guest VM.

This series backports the core of what upstream did: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ along with a couple of prior upstream commits which allow these to apply cleanly to RHEL9, as well as a couple bug fixes that patchreview identified, and one related commit the kernel-mm team requested.

The code can be tested by doing the following (thanks to AlexW for providing this info):

echo "func vfio_pci_mmap_huge_fault +p" > /proc/dynamic_debug/control

Then you'll see things in dmesg like:

vfio-pci 0000:5e:00.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100 vfio-pci 0000:5e:00.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100 vfio-pci 0000:5e:00.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100

Here we know order 9 is a 2MB PMD mapping on x86. BAR0 on this device is 16MB, so 2MB mappings is the best we can get. If you have a device with at least a 1GB BAR, you should also see:

vfio-pci 0000:5e:00.0: vfio_pci_mmap_huge_fault(,order = 18) BAR 1 page offset 0x240000: 0x100 vfio-pci 0000:5e:00.0: vfio_pci_mmap_huge_fault(,order = 18) BAR 1 page offset 0x280000: 0x100 vfio-pci 0000:5e:00.0: vfio_pci_mmap_huge_fault(,order = 18) BAR 1 page offset 0x2c0000: 0x100

Again here we know order 18 is 1GB PUD mappings and BAR1 of this device is 32GB (NVIDIA A10).

You'll need to be running at least QEMU 9.2 to get reliable alignment for PUD mappings (which neither RHEL-9 or RHEL-10 has atm; they have QEMU-9.1) If you see order = 0 mappings for BARs that are at least 2MB, something is wrong.

Omitted-fix: 722376934b6c ("mm/memory.c: simplify pfnmap_lockdep_assert")

Signed-off-by: Donald Dutile ddutile@redhat.com

Edited Mar 26, 2025 by Rafael Aquini

Backport huge pfnmap support for significantly faster large mmap'd PCI device BARs

echo "func vfio_pci_mmap_huge_fault +p" > /proc/dynamic_debug/control

Merge request reports