Skip to content

iommu: Optimise PCI SAC address trick

Jerry Snitselaar requested to merge jsnitsel/centos-stream-9:pci-sac into main

Merge Request Required Information

JIRA: https://issues.redhat.com/browse/RHEL-11705

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Conflicts: There is a conflict in include/linux/iommu.h due to out of order backporting. The conflict note in the commit mentions the specific patch.

Tested: Tested with i40e and mlx5 on networking team's systems with iperf3. I should note that so far I have not come anywhere near exhausting the 32-bit dma space like the OCP customer is seeing. I have the nic running at roughly 30Gbps (40Gbps on the i40e), and using maybe 20% of the range. Still this is a good change if someone does get into a state where the 32-bit dma address space is pretty much exhausted, and fragmented. Instead of ping-ponging back and forth just switch to the larger address space and be done with it. So what this basically is, is per-device dynamic enabling for iommu.forcedac. I also tested with forcedac=1 to exercise it using the large dma address space, with roughly the same results except this time all of the iovas are in that that large range instead of the 32-bit dma space. I looked at the iova address space usage by forcing vmcores on one of the runs, and dumping the io page tables in crash for the active ptes (pykdump script that is a wip, eventually should allow walking page tables for iova, search io page tables for a mapping of a specified physical address, dumping the active entries in the io page tables, showing capabilities/features of iommu, dump of iova rcache depots for Intel, AMD, and ARM):

example output

checking pte: 700000023e180001 @ ffff9ab3764be900 level: 0 iova: 00000000d9320000 (paddr: 000000023e180000)
...
checking pte: 700000011082a001 @ ffff9ab36dd67ff8 level: 0 iova: 00000000fffff000 (paddr: 000000011082a000)

Summary of Changes

Robin's commit message explains it best. There have been attempts to resolve this a couple times now, but it looks like this one is finally going to stick.

        iommu: Optimise PCI SAC address trick
    
        Per the reasoning in commit 4bf7fda4dce2 ("iommu/dma: Add config for
        PCI SAC address trick") and its subsequent revert, this mechanism no
        longer serves its original purpose, but now only works around broken
        hardware/drivers in a way that is unfortunately too impactful to remove.
    
        This does not, however, prevent us from solving the performance impact
        which that workaround has on large-scale systems that don't need it.
        Once the 32-bit IOVA space fills up and a workload starts allocating and
        freeing on both sides of the boundary, the opportunistic SAC allocation
        can then end up spending significant time hunting down scattered
        fragments of free 32-bit space, or just reestablishing max32_alloc_size.
        This can easily be exacerbated by a change in allocation pattern, such
        as by changing the network MTU, which can increase pressure on the
        32-bit space by leaving a large quantity of cached IOVAs which are now
        the wrong size to be recycled, but also won't be freed since the
        non-opportunistic allocations can still be satisfied from the whole
        64-bit space without triggering the reclaim path.
    
        However, in the context of a workaround where smaller DMA addresses
        aren't simply a preference but a necessity, if we get to that point at
        all then in fact it's already the endgame. The nature of the allocator
        is currently such that the first IOVA we give to a device after the
        32-bit space runs out will be the highest possible address for that
        device, ever. If that works, then great, we know we can optimise for
        speed by always allocating from the full range. And if it doesn't, then
        the worst has already happened and any brokenness is now showing, so
        there's little point in continuing to try to hide it.
    
        To that end, implement a flag to refine the SAC business into a
        per-device policy that can automatically get itself out of the way if
        and when it stops being useful.

Approved Development Ticket

All submissions to CentOS Stream must reference an approved ticket in Red Hat Jira. Please follow the CentOS Stream contribution documentation for how to file this ticket and have it approved.

Signed-off-by: Jerry Snitselaar jsnitsel@redhat.com

Edited by Jerry Snitselaar

Merge request reports