Loading
Commits on Source 20
-
Paul Mackerras authored
Since commit fb1522e0 ("KVM: update to new mmu_notifier semantic v2", 2017-08-31), the MMU notifier code in KVM no longer calls the kvm_unmap_hva callback. This removes the PPC implementations of kvm_unmap_hva(). Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
Paul Mackerras authored
This improves the handling of transparent huge pages in the radix hypervisor page fault handler. Previously, if a small page is faulted in to a 2MB region of guest physical space, that means that there is a page table pointer at the PMD level, which could never be replaced by a leaf (2MB) PMD entry. This adds the code to clear the PMD, invlidate the page walk cache and free the page table page in this situation, so that the leaf PMD entry can be created. This also adds code to check whether a PMD or PTE being inserted is the same as is already there (because of a race with another CPU that faulted on the same page) and if so, we don't replace the existing entry, meaning that we don't invalidate the PTE or PMD and do a TLB invalidation. Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
Paul Mackerras authored
When using the radix MMU, we can get hypervisor page fault interrupts with the DSISR_SET_RC bit set in DSISR/HSRR1, indicating that an attempt to set the R (reference) or C (change) bit in a PTE atomically failed. Previously we would find the corresponding Linux PTE and check the permission and dirty bits there, but this is not really necessary since we only need to do what the hardware was trying to do, namely set R or C atomically. This removes the code that reads the Linux PTE and just update the partition-scoped PTE, having first checked that it is still present, and if the access is a write, that the PTE still has write permission. Furthermore, we now check whether any other relevant bits are set in DSISR, and if there are, then we proceed with the rest of the function in order to handle whatever condition they represent, instead of returning to the guest as we did previously. Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
Paul Mackerras authored
This adds code to the radix hypervisor page fault handler to handle the case where the guest memory is backed by 1GB hugepages, and put them into the partition-scoped radix tree at the PUD level. The code is essentially analogous to the code for 2MB pages. This also rearranges kvmppc_create_pte() to make it easier to follow. Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
Paul Mackerras authored
This changes the hypervisor page fault handler for radix guests to use the generic KVM __gfn_to_pfn_memslot() function instead of using get_user_pages_fast() and then handling the case of VM_PFNMAP vmas specially. The old code missed the case of VM_IO vmas; with this change, VM_IO vmas will now be handled correctly by code within __gfn_to_pfn_memslot. Currently, __gfn_to_pfn_memslot calls hva_to_pfn, which only uses __get_user_pages_fast for the initial lookup in the cases where either atomic or async is set. Since we are not setting either atomic or async, we do our own __get_user_pages_fast first, for now. This also adds code to check for the KVM_MEM_READONLY flag on the memslot. If it is set and this is a write access, we synthesize a data storage interrupt for the guest. In the case where the page is not normal RAM (i.e. page == NULL in kvmppc_book3s_radix_page_fault(), we read the PTE from the Linux page tables because we need the mapping attribute bits as well as the PFN. (The mapping attribute bits indicate whether accesses have to be non-cacheable and/or guarded.) Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
npiggin authored
The standard eieio ; tlbsync ; ptesync must follow tlbie to ensure it is ordered with respect to subsequent operations. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
npiggin authored
The current partition table unmap code clears the _PAGE_PRESENT bit out of the pte, which leaves pud_huge/pmd_huge true and does not clear pud_present/pmd_present. This can confuse subsequent page faults and possibly lead to the guest looping doing continual hypervisor page faults. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
Paul Mackerras authored
A radix guest can execute tlbie instructions to invalidate TLB entries. After a tlbie or a group of tlbies, it must then do the architected sequence eieio; tlbsync; ptesync to ensure that the TLB invalidation has been processed by all CPUs in the system before it can rely on no CPU using any translation that it just invalidated. In fact it is the ptesync which does the actual synchronization in this sequence, and hardware has a requirement that the ptesync must be executed on the same CPU thread as the tlbies which it is expected to order. Thus, if a vCPU gets moved from one physical CPU to another after it has done some tlbies but before it can get to do the ptesync, the ptesync will not have the desired effect when it is executed on the second physical CPU. To fix this, we do a ptesync in the exit path for radix guests. If there are any pending tlbies, this will wait for them to complete. If there aren't, then ptesync will just do the same as sync. Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
Aneesh Kumar K.V authored
In the next set of patches, we will switch pmd allocator to use page fragments and the locking will be updated to split pmd ptlock. We want to avoid using fragments for partition-scoped table. Use slab cache similar to level 4 table Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
Aneesh Kumar K.V authored
These function are not used in the code. Remove them. Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
npiggin authored
Implement a local TLB flush for invalidating an LPID with variants for process or partition scope. And a global TLB flush for invalidating a partition scoped page of an LPID. These will be used by KVM in subsequent patches. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
npiggin authored
Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
npiggin authored
When partition scope mappings are unmapped with kvm_unmap_radix, the pte is cleared, but the page table structure is left in place. If the next page fault requests a different page table geometry (e.g., due to THP promotion or split), kvmppc_create_pte is responsible for changing the page tables. When a page table entry is to be converted to a large pte, the page table entry is cleared, the PWC flushed, then the page table it points to freed. This will cause pte page tables to leak when a 1GB page is to replace a pud entry points to a pmd table with pte tables under it: The pmd table will be freed, but its pte tables will be missed. Fix this by replacing the simple clear and free code with one that walks down the page tables and frees children. Care must be taken to clear the root entry being unmapped then flushing the PWC before freeing any page tables, as explained in comments. This requires PWC flush to logically become a flush-all-PWC (which it already is in hardware, but the KVM API needs to be changed to avoid confusion). This code also checks that no unexpected pte entries exist in any page table being freed, and unmaps those and emits a WARN. This is an expensive operation for the pte page level, but partition scope changes are rare, so it's unconditional for now to iron out bugs. It can be put under a CONFIG option or removed after some time. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
npiggin authored
This has the advantage of consolidating TLB flush code in fewer places, and it also implements powerpc:tlbie trace events. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
npiggin authored
The radix guest code can has fewer restrictions about what context it can run in, so move this flushing out of assembly and have it use the Linux TLB flush implementations introduced previously. This allows powerpc:tlbie trace events to be used. This changes the tlbiel sequence to only execute RIC=2 flush once on the first set flushed, then RIC=0 for the rest of the sets. The end result of the flush should be unchanged. This matches the local PID flush pattern that was introduced in a5998fcb ("powerpc/mm/radix: Optimise tlbiel flush all case"). Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
npiggin authored
When the radix fault handler has no page from the process address space (e.g., for IO memory), it looks up the process pte and sets partition table pte using that to get attributes like CI and guarded. If the process table entry is to be writable, set _PAGE_DIRTY as well to avoid an RC update. If not, then ensure _PAGE_DIRTY does not come across. Set _PAGE_ACCESSED as well to avoid RC update. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
npiggin authored
Adding the write bit and RC bits to pte permissions does not require a pte clear and flush. There should not be other bits changed here, because restricting access or changing the PFN must have already invalidated any existing ptes (otherwise the race is already lost). Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
Paul Mackerras authored
Since commit e641a317 ("KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix", 2017-10-26), kvm_unmap_radix() computes the number of PAGE_SIZEd pages being unmapped and passes it to kvmppc_update_dirty_map(), which expects to be passed the page size instead. Consequently it will only mark one system page dirty even when a large page (for example a THP page) is being unmapped. The consequence of this is that part of the THP page might not get copied during live migration, resulting in memory corruption for the guest. This fixes it by computing and passing the page size in kvm_unmap_radix(). Cc: stable@vger.kernel.org # v4.15+ Fixes: e641a317 (KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix) Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
npiggin authored
THP paths can defer splitting compound pages until after the actual remap and TLB flushes to split a huge PMD/PUD. This causes radix partition scope page table mappings to get out of synch with the host qemu page table mappings. This results in random memory corruption in the guest when running with THP. The easiest way to reproduce is use KVM balloon to free up a lot of memory in the guest and then shrink the balloon to give the memory back, while some work is being done in the guest. Cc: David Gibson <david@gibson.dropbear.id.au> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: kvm-ppc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>
-
Paul Mackerras authored
Commit 71d29f43 ("KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size", 2018-09-11) added a call to __find_linux_pte() and a dereference of the returned PTE pointer to the radix page fault path in the common case where the page is normal system memory. Previously, __find_linux_pte() was only called for mappings to physical addresses which don't have a page struct (e.g. memory-mapped I/O) or where the page struct is marked as reserved memory. This exposes us to the possibility that the returned PTE pointer could be NULL, for example in the case of a concurrent THP collapse operation. Dereferencing the returned NULL pointer causes a host crash. To fix this, we check for NULL, and if it is NULL, we retry the operation by returning to the guest, with the expectation that it will generate the same page fault again (unless of course it has been fixed up by another CPU in the meantime). Fixes: 71d29f43 ("KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size") Signed-off-by:
Paul Mackerras <paulus@ozlabs.org> Signed-off-by:
Leonardo Bras <leonardo@linux.ibm.com>