1. 16 Apr, 2018 3 commits
  2. 14 Apr, 2018 13 commits
  3. 13 Apr, 2018 1 commit
    • Michael Ellerman's avatar
      powerpc/64s: Fix CPU_FTRS_ALWAYS vs DT CPU features · 81b654c2
      Michael Ellerman authored
      The cpu_has_feature() mechanism has an optimisation where at build
      time we construct a mask of the CPU feature bits that will always be
      true for the given .config, based on the platform/bitness/etc. that we
      are building for.
      
      That is incompatible with DT CPU features, where the set of CPU
      features is dependent on feature flags that are given to us by
      firmware.
      
      The result is that some feature bits can not be *disabled* by DT CPU
      features. Or more accurately, they can be disabled but they will still
      appear in the ALWAYS mask, meaning cpu_has_feature() will always
      return true for them.
      
      In the past this hasn't really been a problem because on Book3S
      64 (where we support DT CPU features), the set of ALWAYS bits has been
      very small. That was because we always built for POWER4 and later,
      meaning the set of common bits was small.
      
      The only bit that could be cleared by DT CPU features that was also in
      the ALWAYS mask was CPU_FTR_NODSISRALIGN, and that was only used in
      the alignment handler to create a fake DSISR. That code was itself
      deleted in 31bfdb03 ("powerpc: Use instruction emulation
      infrastructure to handle alignment faults") (Sep 2017).
      
      However the set of ALWAYS features changed with the recent commit
      db5ae1c1 ("powerpc/64s: Refine feature sets for little endian
      builds") which restricted the set of feature flags when building
      little endian to Power7 or later. That caused the ALWAYS mask to
      become much larger for little endian builds.
      
      The result is that the following feature bits can currently not
      be *disabled* by DT CPU features:
      
        CPU_FTR_REAL_LE, CPU_FTR_MMCRA, CPU_FTR_CTRL, CPU_FTR_SMT,
        CPU_FTR_PURR, CPU_FTR_SPURR, CPU_FTR_DSCR, CPU_FTR_PKEY,
        CPU_FTR_VMX_COPY, CPU_FTR_CFAR, CPU_FTR_HAS_PPR.
      
      To fix it we need to mask the set of ALWAYS features with the base set
      of DT CPU features, ie. the features that are always enabled by DT CPU
      features. That way there are no bits in the ALWAYS mask that are not
      also always set by DT CPU features.
      
      Fixes: db5ae1c1 ("powerpc/64s: Refine feature sets for little endian builds")
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: Michael Ellerman's avatarMichael Ellerman <mpe@ellerman.id.au>
      81b654c2
  4. 12 Apr, 2018 21 commits
    • Thomas Petazzoni's avatar
      arch/sh: pcie-sh7786: handle non-zero DMA offset · bf9c7e3d
      Thomas Petazzoni authored
      On SuperH, the base of the physical memory might be different from
      zero. In this case, PCI address zero will map to a non-zero physical
      address. In order to make sure that the DMA mapping API takes care of
      this DMA offset, we must fill in the dev->dma_pfn_offset field for PCI
      devices. This gets done in the pcibios_bus_add_device() hook, called
      for each new PCI device detected.
      
      The dma_pfn_offset global variable is re-calculated for every PCI
      controller available on the platform, but that's not an issue because
      its value will each time be exactly the same, as it only depends on
      the memory start address and memory size.
      Signed-off-by: Thomas Petazzoni's avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: Rich Felker's avatarRich Felker <dalias@libc.org>
      bf9c7e3d
    • Thomas Petazzoni's avatar
      arch/sh: pcie-sh7786: adjust the memory mapping · 79e1c5e7
      Thomas Petazzoni authored
      The code setting up the PCI -> SuperHighway mapping doesn't take into
      account the fact that the address stored in PCIELARx must be aligned
      with the size stored in PCIELAMRx.
      
      For example, when your physical memory starts at 0x0800_0000 (128 MB),
      a size of 64 MB or 128 MB is fine. However, if you have 256 MB of
      memory, it doesn't work because the base address is not aligned on the
      size.
      
      In such situation, we have to round down the base address to make sure
      it is aligned on the size of the area. For for a 0x0800_0000 base
      address with 256 MB of memory, we will round down to 0x0, and extend
      the size of the mapping to 512 MB.
      
      This allows the mapping to work on platforms that have 256 MB of
      RAM. The current setup would only work with 128 MB of RAM or less.
      Signed-off-by: Thomas Petazzoni's avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: Rich Felker's avatarRich Felker <dalias@libc.org>
      79e1c5e7
    • Thomas Petazzoni's avatar
      arch/sh: pcie-sh7786: adjust PCI MEM and IO regions · 5da1bb96
      Thomas Petazzoni authored
      The current definition of the PCIe IO and MEM resources for SH7786
      doesn't match what the datasheet says. For example, for PCIe0
      0xfe100000 is advertised by the datasheet as a PCI IO region, while
      0xfd000000 is advertised as a PCI MEM region. The code currently
      inverts the two.
      
      The SH4A_PCIEPARL and SH4A_PCIEPTCTLR registers allow to define the
      base address and role of the different regions (including whether it's
      a MEM or IO region). However, practical experience on a SH7786 shows
      that if 0xfe100000 is used for LEL and 0xfd000000 for IO, a PCIe
      device using two MEM BARs cannot be accessed at all. Simply using
      0xfe100000 for IO and 0xfd000000 for MEM makes the PCIe device
      accessible.
      
      It is very likely that this was never seen because there are two other
      PCI MEM region listed in the resources. However, for different
      reasons, none of the two other MEM regions are usable on the specific
      SH7786 platform the problem was encountered. Therefore, the last MEM
      region at 0xfe100000 was used to place the BARs, making the device
      non-functional.
      
      This commit therefore adjusts those PCI MEM and IO resources
      definitions so that they match what the datasheet says. They have only
      been tested with PCIe 0.
      Signed-off-by: Thomas Petazzoni's avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: Rich Felker's avatarRich Felker <dalias@libc.org>
      5da1bb96
    • Thomas Petazzoni's avatar
      arch/sh: pcie-sh7786: exclude unusable PCI MEM areas · d62e9bf5
      Thomas Petazzoni authored
      Depending on the physical memory layout, some PCI MEM areas are not
      usable. According to the SH7786 datasheet, the PCI MEM area from
      1000_0000 to 13FF_FFFF is only usable if the physical memory layout
      (in MMSELR) is 1, 2, 5 or 6. In all other configurations, this PCI MEM
      area is not usable (because it overlaps with DRAM).
      
      Therefore, this commit adjusts the PCI SH7786 initialization to mark
      the relevant PCI resource as IORESOURCE_DISABLED if we can't use it.
      Signed-off-by: Thomas Petazzoni's avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: Rich Felker's avatarRich Felker <dalias@libc.org>
      d62e9bf5
    • Thomas Petazzoni's avatar
      arch/sh: pcie-sh7786: mark unavailable PCI resource as disabled · 7dd7f698
      Thomas Petazzoni authored
      Some PCI MEM resources are marked as IORESOURCE_MEM_32BIT, which means
      they are only usable when the SH core runs in 32-bit mode. In 29-bit
      mode, such memory regions are not usable.
      
      The existing code for SH7786 properly skips such regions when
      configuring the PCIe controller registers. However, because such
      regions are still described in the resource array, the
      pcibios_scanbus() function in the SuperH pci.c will register them to
      the PCI core. Due to this, the PCI core will allocate MEM areas from
      this resource, and assign BARs pointing to this area, even though it's
      unusable.
      
      In order to prevent this from happening, we mark such regions as
      IORESOURCE_DISABLED, which tells the SuperH pci.c pcibios_scanbus()
      function to skip them.
      
      Note that we separate marking the region as disabled from skipping it,
      because other regions will be marked as disabled in follow-up patches.
      Signed-off-by: Thomas Petazzoni's avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: Rich Felker's avatarRich Felker <dalias@libc.org>
      7dd7f698
    • Thomas Petazzoni's avatar
      arch/sh: pci: don't use disabled resources · 3aeb93a0
      Thomas Petazzoni authored
      In pcibios_scanbus(), we provide to the PCI core the usable MEM and IO
      regions using pci_add_resource_offset(). We travel through all
      resources available in the "struct pci_channel".
      
      Also, in register_pci_controller(), we travel through all resources to
      request them, making sure they don't conflict with already requested
      resources.
      
      However, some resources may be disabled, in which case they should not
      be requested nor provided to the PCI core.
      
      In the current situation, none of the resources are disabled. However,
      follow-up patches in this series will make some resources disabled,
      making this preliminary change necessary.
      Signed-off-by: Thomas Petazzoni's avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: Rich Felker's avatarRich Felker <dalias@libc.org>
      3aeb93a0
    • Thomas Petazzoni's avatar
      arch/sh: make the DMA mapping operations observe dev->dma_pfn_offset · ce883130
      Thomas Petazzoni authored
      Some devices may have a non-zero DMA offset, i.e an offset between the
      DMA address and the physical address. Such an offset can be encoded
      into the dma_pfn_offset field of "struct device", but the SuperH
      implementation of the DMA mapping API does not observe this
      information.
      
      This commit fixes that by ensuring the DMA address is properly
      calculated depending on this DMA offset.
      Signed-off-by: Thomas Petazzoni's avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: Rich Felker's avatarRich Felker <dalias@libc.org>
      ce883130
    • Thomas Petazzoni's avatar
      arch/sh: add sh7786_mm_sel() function · bc05aa6e
      Thomas Petazzoni authored
      The SH7786 has different physical memory layout configurations,
      configurable through the MMSELR register. The configuration is
      typically defined by the bootloader, so Linux generally doesn't care.
      
      Except that depending on the configuration, some PCI MEM areas may or
      may not be available. This commit adds a helper function that allows
      to retrieve the current physical memory layout configuration. It will
      be used in a following patch to exclude unusable PCI MEM areas during
      the PCI initialization.
      Signed-off-by: Thomas Petazzoni's avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: Rich Felker's avatarRich Felker <dalias@libc.org>
      bc05aa6e
    • Rich Felker's avatar
      sh: fix debug trap failure to process signals before return to user · 96a59899
      Rich Felker authored
      When responding to a debug trap (breakpoint) in userspace, the
      kernel's trap handler raised SIGTRAP but returned from the trap via a
      code path that ignored pending signals, resulting in an infinite loop
      re-executing the trapping instruction.
      Signed-off-by: Rich Felker's avatarRich Felker <dalias@libc.org>
      96a59899
    • Rich Felker's avatar
      sh: fix memory corruption of unflattened device tree · eb6b6930
      Rich Felker authored
      unflatten_device_tree() makes use of memblock allocation, and
      therefore must be called before paging_init() migrates the memblock
      allocation data to the bootmem framework. Otherwise the record of the
      allocation for the expanded device tree will be lost, and will
      eventually be clobbered when allocated for another use.
      Signed-off-by: Rich Felker's avatarRich Felker <dalias@libc.org>
      eb6b6930
    • Aurelien Jarno's avatar
      sh: fix futex FUTEX_OP_SET op on userspace addresses · 9b7e30ab
      Aurelien Jarno authored
      Commit 00b73d8d ("sh: add working futex atomic ops on userspace
      addresses for smp") changed the futex_atomic_op_inuser function to
      use a loop. In case of the FUTEX_OP_SET op with a userspace address
      containing a value different of 0, this loop is an endless loop.
      
      Fix that by loading the value of oldval from the userspace before doing
      the cmpxchg op, also for the FUTEX_OP_SET case.
      Signed-off-by: default avatarAurelien Jarno <aurelien@aurel32.net>
      Signed-off-by: Rich Felker's avatarRich Felker <dalias@libc.org>
      9b7e30ab
    • Krish Sadhukhan's avatar
      x86: Add check for APIC access address for vmentry of L2 guests · f0f4cf5b
      Krish Sadhukhan authored
      According to the sub-section titled 'VM-Execution Control Fields' in the
      section titled 'Basic VM-Entry Checks' in Intel SDM vol. 3C, the following
      vmentry check must be enforced:
      
          If the 'virtualize APIC-accesses' VM-execution control is 1, the
          APIC-access address must satisfy the following checks:
      
      	- Bits 11:0 of the address must be 0.
      	- The address should not set any bits beyond the processor's
      	  physical-address width.
      
      This patch adds the necessary check to conform to this rule. If the check
      fails, we cause the L2 VMENTRY to fail which is what the associated unit
      test (following patch) expects.
      Reviewed-by: default avatarMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f0f4cf5b
    • Michael Ellerman's avatar
      powerpc/mm/radix: Fix checkstops caused by invalid tlbiel · 2675c13b
      Michael Ellerman authored
      In tlbiel_radix_set_isa300() we use the PPC_TLBIEL() macro to
      construct tlbiel instructions. The instruction takes 5 fields, two of
      which are registers, and the others are constants. But because it's
      constructed with inline asm the compiler doesn't know that.
      
      We got the constraint wrong on the 'r' field, using "r" tells the
      compiler to put the value in a register. The value we then get in the
      macro is the *register number*, not the value of the field.
      
      That means when we mask the register number with 0x1 we get 0 or 1
      depending on which register the compiler happens to put the constant
      in, eg:
      
        li      r10,1
        tlbiel  r8,r9,2,0,0
      
        li      r7,1
        tlbiel  r10,r6,0,0,1
      
      If we're unlucky we might generate an invalid instruction form, for
      example RIC=0, PRS=1 and R=0, tlbiel r8,r7,0,1,0, this has been
      observed to cause machine checks:
      
        Oops: Machine check, sig: 7 [#1]
        CPU: 24 PID: 0 Comm: swapper
        NIP:  00000000000385f4 LR: 000000000100ed00 CTR: 000000000000007f
        REGS: c00000000110bb40 TRAP: 0200
        MSR:  9000000000201003 <SF,HV,ME,RI,LE>  CR: 48002222  XER: 20040000
        CFAR: 00000000000385d0 DAR: 0000000000001c00 DSISR: 00000200 SOFTE: 1
      
      If the machine check happens early in boot while we have MSR_ME=0 it
      will escalate into a checkstop and kill the box entirely.
      
      To fix it we could change the inline asm constraint to "i" which
      tells the compiler the value is a constant. But a better fix is to just
      pass a literal 1 into the macro, which bypasses any problems with inline
      asm constraints.
      
      Fixes: d4748276 ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: Michael Ellerman's avatarMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      2675c13b
    • Joerg Roedel's avatar
      x86/pgtable: Don't set huge PUD/PMD on non-leaf entries · e3e28812
      Joerg Roedel authored
      The pmd_set_huge() and pud_set_huge() functions are used from
      the generic ioremap() code to establish large mappings where this
      is possible.
      
      But the generic ioremap() code does not check whether the
      PMD/PUD entries are already populated with a non-leaf entry,
      so that any page-table pages these entries point to will be
      lost.
      
      Further, on x86-32 with SHARED_KERNEL_PMD=0, this causes a
      BUG_ON() in vmalloc_sync_one() when PMD entries are synced
      from swapper_pg_dir to the current page-table. This happens
      because the PMD entry from swapper_pg_dir was promoted to a
      huge-page entry while the current PGD still contains the
      non-leaf entry. Because both entries are present and point
      to a different page, the BUG_ON() triggers.
      
      This was actually triggered with pti-x32 enabled in a KVM
      virtual machine by the graphics driver.
      
      A real and better fix for that would be to improve the
      page-table handling in the generic ioremap() code. But that is
      out-of-scope for this patch-set and left for later work.
      Reported-by: default avatarDavid H. Gutteridge <dhgutteridge@sympatico.ca>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Waiman Long <llong@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180411152437.GC15462@8bytes.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e3e28812
    • Dave Hansen's avatar
      x86/pti: Leave kernel text global for !PCID · 8c06c774
      Dave Hansen authored
      Global pages are bad for hardening because they potentially let an
      exploit read the kernel image via a Meltdown-style attack which
      makes it easier to find gadgets.
      
      But, global pages are good for performance because they reduce TLB
      misses when making user/kernel transitions, especially when PCIDs
      are not available, such as on older hardware, or where a hypervisor
      has disabled them for some reason.
      
      This patch implements a basic, sane policy: If you have PCIDs, you
      only map a minimal amount of kernel text global.  If you do not have
      PCIDs, you map all kernel text global.
      
      This policy effectively makes PCIDs something that not only adds
      performance but a little bit of hardening as well.
      
      I ran a simple "lseek" microbenchmark[1] to test the benefit on
      a modern Atom microserver.  Most of the benefit comes from applying
      the series before this patch ("entry only"), but there is still a
      signifiant benefit from this patch.
      
        No Global Lines (baseline  ): 6077741 lseeks/sec
        88 Global Lines (entry only): 7528609 lseeks/sec (+23.9%)
        94 Global Lines (this patch): 8433111 lseeks/sec (+38.8%)
      
      [1.] https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.cSigned-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205518.E3D989EB@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8c06c774
    • Dave Hansen's avatar
      x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image · 39114b7a
      Dave Hansen authored
      Summary:
      
      In current kernels, with PTI enabled, no pages are marked Global. This
      potentially increases TLB misses.  But, the mechanism by which the Global
      bit is set and cleared is rather haphazard.  This patch makes the process
      more explicit.  In the end, it leaves us with Global entries in the page
      tables for the areas truly shared by userspace and kernel and increases
      TLB hit rates.
      
      The place this patch really shines in on systems without PCIDs.  In this
      case, we are using an lseek microbenchmark[1] to see how a reasonably
      non-trivial syscall behaves.  Higher is better:
      
        No Global pages (baseline): 6077741 lseeks/sec
        88 Global Pages (this set): 7528609 lseeks/sec (+23.9%)
      
      On a modern Skylake desktop with PCIDs, the benefits are tangible, but not
      huge for a kernel compile (lower is better):
      
        No Global pages (baseline): 186.951 seconds time elapsed  ( +-  0.35% )
        28 Global pages (this set): 185.756 seconds time elapsed  ( +-  0.09% )
                                     -1.195 seconds (-0.64%)
      
      I also re-checked everything using the lseek1 test[1]:
      
        No Global pages (baseline): 15783951 lseeks/sec
        28 Global pages (this set): 16054688 lseeks/sec
      			     +270737 lseeks/sec (+1.71%)
      
      The effect is more visible, but still modest.
      
      Details:
      
      The kernel page tables are inherited from head_64.S which rudely marks
      them as _PAGE_GLOBAL.  For PTI, we have been relying on the grace of
      $DEITY and some insane behavior in pageattr.c to clear _PAGE_GLOBAL.
      This patch tries to do better.
      
      First, stop filtering out "unsupported" bits from being cleared in the
      pageattr code.  It's fine to filter out *setting* these bits but it
      is insane to keep us from clearing them.
      
      Then, *explicitly* go clear _PAGE_GLOBAL from the kernel identity map.
      Do not rely on pageattr to do it magically.
      
      After this patch, we can see that "GLB" shows up in each copy of the
      page tables, that we have the same number of global entries in each
      and that they are the *same* entries.
      
        /sys/kernel/debug/page_tables/current_kernel:11
        /sys/kernel/debug/page_tables/current_user:11
        /sys/kernel/debug/page_tables/kernel:11
      
        9caae8ad6a1fb53aca2407ec037f612d  current_kernel.GLB
        9caae8ad6a1fb53aca2407ec037f612d  current_user.GLB
        9caae8ad6a1fb53aca2407ec037f612d  kernel.GLB
      
      A quick visual audit also shows that all the entries make sense.
      0xfffffe0000000000 is the cpu_entry_area and 0xffffffff81c00000
      is the entry/exit text:
      
        0xfffffe0000000000-0xfffffe0000002000           8K     ro                 GLB NX pte
        0xfffffe0000002000-0xfffffe0000003000           4K     RW                 GLB NX pte
        0xfffffe0000003000-0xfffffe0000006000          12K     ro                 GLB NX pte
        0xfffffe0000006000-0xfffffe0000007000           4K     ro                 GLB x  pte
        0xfffffe0000007000-0xfffffe000000d000          24K     RW                 GLB NX pte
        0xfffffe000002d000-0xfffffe000002e000           4K     ro                 GLB NX pte
        0xfffffe000002e000-0xfffffe000002f000           4K     RW                 GLB NX pte
        0xfffffe000002f000-0xfffffe0000032000          12K     ro                 GLB NX pte
        0xfffffe0000032000-0xfffffe0000033000           4K     ro                 GLB x  pte
        0xfffffe0000033000-0xfffffe0000039000          24K     RW                 GLB NX pte
        0xffffffff81c00000-0xffffffff81e00000           2M     ro         PSE     GLB x  pmd
      
      [1.] https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.cSigned-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205517.C80FBE05@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      39114b7a
    • Dave Hansen's avatar
      x86/pti: Enable global pages for shared areas · 0f561fce
      Dave Hansen authored
      The entry/exit text and cpu_entry_area are mapped into userspace and
      the kernel.  But, they are not _PAGE_GLOBAL.  This creates unnecessary
      TLB misses.
      
      Add the _PAGE_GLOBAL flag for these areas.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205515.2977EE7D@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0f561fce
    • Dave Hansen's avatar
      x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init · 639d6aaf
      Dave Hansen authored
      __ro_after_init data gets stuck in the .rodata section.  That's normally
      fine because the kernel itself manages the R/W properties.
      
      But, if we run __change_page_attr() on an area which is __ro_after_init,
      the .rodata checks will trigger and force the area to be immediately
      read-only, even if it is early-ish in boot.  This caused problems when
      trying to clear the _PAGE_GLOBAL bit for these area in the PTI code:
      it cleared _PAGE_GLOBAL like I asked, but also took it up on itself
      to clear _PAGE_RW.  The kernel then oopses the next time it wrote to
      a __ro_after_init data structure.
      
      To fix this, add the kernel_set_to_readonly check, just like we have
      for kernel text, just a few lines below in this function.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205514.8D898241@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      639d6aaf
    • Dave Hansen's avatar
      x86/mm: Comment _PAGE_GLOBAL mystery · 430d4005
      Dave Hansen authored
      I was mystified as to where the _PAGE_GLOBAL in the kernel page tables
      for kernel text came from.  I audited all the places I could find, but
      I missed one: head_64.S.
      
      The page tables that we create in here live for a long time, and they
      also have _PAGE_GLOBAL set, despite whether the processor supports it
      or not.  It's harmless, and we got *lucky* that the pageattr code
      accidentally clears it when we wipe it out of __supported_pte_mask and
      then later try to mark kernel text read-only.
      
      Comment some of these properties to make it easier to find and
      understand in the future.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205513.079BB265@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      430d4005
    • Dave Hansen's avatar
      x86/mm: Remove extra filtering in pageattr code · 1a54420a
      Dave Hansen authored
      The pageattr code has a mode where it can set or clear PTE bits in
      existing PTEs, so the page protections of the *new* PTEs come from
      one of two places:
      
        1. The set/clear masks: cpa->mask_clr / cpa->mask_set
        2. The existing PTE
      
      We filter ->mask_set/clr for supported PTE bits at entry to
      __change_page_attr() so we never need to filter them again.
      
      The only other place permissions can come from is an existing PTE
      and those already presumably have good bits.  We do not need to filter
      them again.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205511.BC072352@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1a54420a
    • Dave Hansen's avatar
      x86/mm: Do not auto-massage page protections · fb43d6cb
      Dave Hansen authored
      A PTE is constructed from a physical address and a pgprotval_t.
      __PAGE_KERNEL, for instance, is a pgprot_t and must be converted
      into a pgprotval_t before it can be used to create a PTE.  This is
      done implicitly within functions like pfn_pte() by massage_pgprot().
      
      However, this makes it very challenging to set bits (and keep them
      set) if your bit is being filtered out by massage_pgprot().
      
      This moves the bit filtering out of pfn_pte() and friends.  For
      users of PAGE_KERNEL*, filtering will be done automatically inside
      those macros but for users of __PAGE_KERNEL*, they need to do their
      own filtering now.
      
      Note that we also just move pfn_pte/pmd/pud() over to check_pgprot()
      instead of massage_pgprot().  This way, we still *look* for
      unsupported bits and properly warn about them if we find them.  This
      might happen if an unfiltered __PAGE_KERNEL* value was passed in,
      for instance.
      
      - printk format warning fix from: Arnd Bergmann <arnd@arndb.de>
      - boot crash fix from:            Tom Lendacky <thomas.lendacky@amd.com>
      - crash bisected by:              Mike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reported-and-fixed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixed-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Bisected-by: default avatarMike Galbraith <efault@gmx.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205509.77E1D7F6@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      fb43d6cb
  5. 11 Apr, 2018 2 commits
    • Helge Deller's avatar
      parisc: Prevent panic at system halt · 67698287
      Helge Deller authored
      When issuing a "shutdown -h now", the reboot syscall calls kernel_halt()
      which shouldn't return, otherwise one gets this panic:
      
      reboot: System halted
      Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000
      CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 4.16.0-32bit+ #560
      Backtrace:
       [<1018a694>] show_stack+0x18/0x28
       [<106e68a8>] dump_stack+0x80/0x10c
       [<101a4df8>] panic+0xfc/0x290
       [<101a90b8>] do_exit+0x73c/0x914
       [<101c7e38>] SyS_reboot+0x190/0x1d4
       [<1017e444>] syscall_exit+0x0/0x14
      
      Fix it by letting machine_halt() call machine_power_off() which doesn't
      return.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      67698287
    • Ard Biesheuvel's avatar
      arm64: assembler: add macros to conditionally yield the NEON under PREEMPT · 24534b35
      Ard Biesheuvel authored
      Add support macros to conditionally yield the NEON (and thus the CPU)
      that may be called from the assembler code.
      
      In some cases, yielding the NEON involves saving and restoring a non
      trivial amount of context (especially in the CRC folding algorithms),
      and so the macro is split into three, and the code in between is only
      executed when the yield path is taken, allowing the context to be preserved.
      The third macro takes an optional label argument that marks the resume
      path after a yield has been performed.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reviewed-by: default avatarDave Martin <Dave.Martin@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      24534b35