1. 12 Jun, 2018 1 commit
  2. 09 Jun, 2018 1 commit
  3. 08 Jun, 2018 4 commits
    • Dave Martin's avatar
      arm64: Fix syscall restarting around signal suppressed by tracer · 0fe42512
      Dave Martin authored
      Commit 17c28958 ("arm64: Abstract syscallno manipulation") abstracts
      out the pt_regs.syscallno value for a syscall cancelled by a tracer
      as NO_SYSCALL, and provides helpers to set and check for this
      condition.  However, the way this was implemented has the
      unintended side-effect of disabling part of the syscall restart
      logic.
      
      This comes about because the second in_syscall() check in
      do_signal() re-evaluates the "in a syscall" condition based on the
      updated pt_regs instead of the original pt_regs.  forget_syscall()
      is explicitly called prior to the second check in order to prevent
      restart logic in the ret_to_user path being spuriously triggered,
      which means that the second in_syscall() check always yields false.
      
      This triggers a failure in
      tools/testing/selftests/seccomp/seccomp_bpf.c, when using ptrace to
      suppress a signal that interrups a nanosleep() syscall.
      
      Misbehaviour of this type is only expected in the case where a
      tracer suppresses a signal and the target process is either being
      single-stepped or the interrupted syscall attempts to restart via
      -ERESTARTBLOCK.
      
      This patch restores the old behaviour by performing the
      in_syscall() check only once at the start of the function.
      
      Fixes: 17c28958 ("arm64: Abstract syscallno manipulation")
      Signed-off-by: default avatarDave Martin <Dave.Martin@arm.com>
      Reported-by: default avatarSumit Semwal <sumit.semwal@linaro.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org> # 4.14.x-
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      0fe42512
    • Matthew Wilcox's avatar
      mm: add pt_mm to struct page · a052f0a5
      Matthew Wilcox authored
      For pgd page table pages, x86 overloads the page->index field to store a
      pointer to the mm_struct.  Rename this to pt_mm so it's visible to other
      users.
      
      Link: http://lkml.kernel.org/r/20180518194519.3820-13-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a052f0a5
    • Matthew Wilcox's avatar
      s390: use _refcount for pgtables · 620b4e90
      Matthew Wilcox authored
      Patch series "Rearrange struct page", v6.
      
      As presented at LSFMM, this patch-set rearranges struct page to give
      more contiguous usable space to users who have allocated a struct page
      for their own purposes.  For a graphical view of before-and-after, see
      the first two tabs of
      
        https://docs.google.com/spreadsheets/d/1tvCszs_7FXrjei9_mtFiKV6nW1FLnYyvPvW-qNZhdog/edit?usp=sharing
      
      Highlights:
       - deferred_list now really exists in struct page instead of just a comment.
       - hmm_data also exists in struct page instead of being a nasty hack.
       - x86's PGD pages have a real pointer to the mm_struct.
       - VMalloc pages now have all sorts of extra information stored in them
         to help with debugging and tuning.
       - rcu_head is no longer tied to slab in case anyone else wants to
         free pages by RCU.
       - slub's counters no longer share space with _refcount.
       - slub's freelist+counters are now naturally dword aligned.
       - slub loses a parameter to a lot of functions and a sysfs file.
      
      This patch (of 17):
      
      s390 borrows the storage used for _mapcount in struct page in order to
      account whether the bottom or top half is being used for 2kB page tables.
      I want to use that for something else, so use the top byte of _refcount
      instead of the bottom byte of _mapcount.  _refcount may temporarily be
      incremented by other CPUs that see a stale pointer to this page in the
      page cache, but each CPU can only increment it by one, and there are no
      systems with 2^24 CPUs today, so they will not change the upper byte of
      _refcount.  We do have to be a little careful not to lose any of their
      writes (as they will subsequently decrement the counter).
      
      Link: http://lkml.kernel.org/r/20180518194519.3820-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      620b4e90
    • Laurent Dufour's avatar
      mm: introduce ARCH_HAS_PTE_SPECIAL · 3010a5ea
      Laurent Dufour authored
      Currently the PTE special supports is turned on in per architecture
      header files.  Most of the time, it is defined in
      arch/*/include/asm/pgtable.h depending or not on some other per
      architecture static definition.
      
      This patch introduce a new configuration variable to manage this
      directly in the Kconfig files.  It would later replace
      __HAVE_ARCH_PTE_SPECIAL.
      
      Here notes for some architecture where the definition of
      __HAVE_ARCH_PTE_SPECIAL is not obvious:
      
      arm
       __HAVE_ARCH_PTE_SPECIAL which is currently defined in
      arch/arm/include/asm/pgtable-3level.h which is included by
      arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
      So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.
      
      powerpc
      __HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
       - arch/powerpc/include/asm/book3s/64/pgtable.h
       - arch/powerpc/include/asm/pte-common.h
      The first one is included if (PPC_BOOK3S & PPC64) while the second is
      included in all the other cases.
      So select ARCH_HAS_PTE_SPECIAL all the time.
      
      sparc:
      __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
      defined(__arch64__) which are defined through the compiler in
      sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
      So select ARCH_HAS_PTE_SPECIAL if SPARC64
      
      There is no functional change introduced by this patch.
      
      Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.comSigned-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Suggested-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Albert Ou <albert@sifive.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Christophe LEROY <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3010a5ea
  4. 07 Jun, 2018 3 commits
    • Willem de Bruijn's avatar
      net: in virtio_net_hdr only add VLAN_HLEN to csum_start if payload holds vlan · fd3a8862
      Willem de Bruijn authored
      Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr
      to communicate packet metadata to userspace.
      
      For skbuffs with vlan, the first two return the packet as it may have
      existed on the wire, inserting the VLAN tag in the user buffer.  Then
      virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes.
      
      Commit f09e2249 ("macvtap: restore vlan header on user read")
      added this feature to macvtap. Commit 3ce9b20f ("macvtap: Fix
      csum_start when VLAN tags are present") then fixed up csum_start.
      
      Virtio, packet and uml do not insert the vlan header in the user
      buffer.
      
      When introducing virtio_net_hdr_from_skb to deduplicate filling in
      the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was
      applied uniformly, breaking csum offset for packets with vlan on
      virtio and packet.
      
      Make insertion of VLAN_HLEN optional. Convert the callers to pass it
      when needed.
      
      Fixes: e858fae2 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
      Fixes: 1276f24e ("packet: use common code for virtio_net_hdr and skb GSO conversion")
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd3a8862
    • Jeremy Linton's avatar
      arm64: topology: Avoid checking numa mask for scheduler MC selection · e156ab71
      Jeremy Linton authored
      The numa mask subset check can often lead to system hang or crash during
      CPU hotplug and system suspend operation if NUMA is disabled. This is
      mostly observed on HMP systems where the CPU compute capacities are
      different and ends up in different scheduler domains. Since
      cpumask_of_node is returned instead core_sibling, the scheduler is
      confused with incorrect cpumasks(e.g. one CPU in two different sched
      domains at the same time) on CPU hotplug.
      
      Lets disable the NUMA siblings checks for the time being, as NUMA in
      socket machines have LLC's that will assure that the scheduler topology
      isn't "borken".
      
      The NUMA check exists to assure that if a LLC within a socket crosses
      NUMA nodes/chiplets the scheduler domains remain consistent. This code will
      likely have to be re-enabled in the near future once the NUMA mask story
      is sorted.  At the moment its not necessary because the NUMA in socket
      machines LLC's are contained within the NUMA domains.
      
      Further, as a defensive mechanism during hot-plug, lets assure that the
      LLC siblings are also masked.
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarSudeep Holla <sudeep.holla@arm.com>
      Tested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarJeremy Linton <jeremy.linton@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      e156ab71
    • Mark Brown's avatar
      regulator: gpio: Revert · e536700e
      Mark Brown authored
      regulator: fixed/gpio: Revert GPIO descriptor changes due to platform breakage
      
      Commit 6059577c "regulator: fixed: Convert to use GPIO descriptor
      only" broke at least the ams-delta platform since the lookup tables
      added to the board files use the function name "enable" while the driver
      uses NULL causing the regulator to not acquire and control the enable
      GPIOs.  Revert that and a couple of other commits that are caught up
      with it to fix the issue:
      
      2b6c00c1 "ARM: pxa, regulator: fix building ezx e680"
      6059577c "regulator: fixed: Convert to use GPIO descriptor only"
      37bed97f "regulator: gpio: Get enable GPIO using GPIO descriptor"
      Reported-by: default avatarJanusz Krzysztofik <jmkrzyszt@gmail.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      e536700e
  5. 06 Jun, 2018 23 commits
    • Thomas Gleixner's avatar
      x86/apic/vector: Print APIC control bits in debugfs · a07771ac
      Thomas Gleixner authored
      Extend the debugability of the vector management by adding the state bits
      to the debugfs output.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Song Liu <liu.song.a23@gmail.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Mike Travis <mike.travis@hpe.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Link: https://lkml.kernel.org/r/20180604162224.908136099@linutronix.de
      a07771ac
    • Thomas Gleixner's avatar
      x86/platform/uv: Use apic_ack_irq() · 839b0f1c
      Thomas Gleixner authored
      To address the EBUSY fail of interrupt affinity settings in case that the
      previous setting has not been cleaned up yet, use the new apic_ack_irq()
      function instead of the special uv_ack_apic() implementation which is
      merily a wrapper around ack_APIC_irq().
      
      Preparatory change for the real fix
      
      Fixes: dccfe314 ("x86/vector: Simplify vector move cleanup")
      Reported-by: default avatarSong Liu <liu.song.a23@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: Mike Travis <mike.travis@hpe.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Link: https://lkml.kernel.org/r/20180604162224.721691398@linutronix.de
      839b0f1c
    • Thomas Gleixner's avatar
      x86/ioapic: Use apic_ack_irq() · 2b04e46d
      Thomas Gleixner authored
      To address the EBUSY fail of interrupt affinity settings in case that the
      previous setting has not been cleaned up yet, use the new apic_ack_irq()
      function instead of directly invoking ack_APIC_irq().
      
      Preparatory change for the real fix
      
      Fixes: dccfe314 ("x86/vector: Simplify vector move cleanup")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Song Liu <liu.song.a23@gmail.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: Mike Travis <mike.travis@hpe.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Link: https://lkml.kernel.org/r/20180604162224.639011135@linutronix.de
      2b04e46d
    • Thomas Gleixner's avatar
      x86/apic: Provide apic_ack_irq() · c0255770
      Thomas Gleixner authored
      apic_ack_edge() is explicitely for handling interrupt affinity cleanup when
      interrupt remapping is not available or disable.
      
      Remapped interrupts and also some of the platform specific special
      interrupts, e.g. UV, invoke ack_APIC_irq() directly.
      
      To address the issue of failing an affinity update with -EBUSY the delayed
      affinity mechanism can be reused, but ack_APIC_irq() does not handle
      that. Adding this to ack_APIC_irq() is not possible, because that function
      is also used for exceptions and directly handled interrupts like IPIs.
      
      Create a new function, which just contains the conditional invocation of
      irq_move_irq() and the final ack_APIC_irq().
      
      Reuse the new function in apic_ack_edge().
      
      Preparatory change for the real fix.
      
      Fixes: dccfe314 ("x86/vector: Simplify vector move cleanup")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Song Liu <liu.song.a23@gmail.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: Mike Travis <mike.travis@hpe.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Link: https://lkml.kernel.org/r/20180604162224.471925894@linutronix.de
      c0255770
    • Thomas Gleixner's avatar
      x86/apic/vector: Prevent hlist corruption and leaks · 80ae7b1a
      Thomas Gleixner authored
      Several people observed the WARN_ON() in irq_matrix_free() which triggers
      when the caller tries to free an vector which is not in the allocation
      range. Song provided the trace information which allowed to decode the root
      cause.
      
      The rework of the vector allocation mechanism failed to preserve a sanity
      check, which prevents setting a new target vector/CPU when the previous
      affinity change has not fully completed.
      
      As a result a half finished affinity change can be overwritten, which can
      cause the leak of a irq descriptor pointer on the previous target CPU and
      double enqueue of the hlist head into the cleanup lists of two or more
      CPUs. After one CPU cleaned up its vector the next CPU will invoke the
      cleanup handler with vector 0, which triggers the out of range warning in
      the matrix allocator.
      
      Prevent this by checking the apic_data of the interrupt whether the
      move_in_progress flag is false and the hlist node is not hashed. Return
      -EBUSY if not.
      
      This prevents the damage and restores the behaviour before the vector
      allocation rework, but due to other changes in that area it also widens the
      chance that user space can observe -EBUSY. In theory this should be fine,
      but actually not all user space tools handle -EBUSY correctly. Addressing
      that is not part of this fix, but will be addressed in follow up patches.
      
      Fixes: 69cde000 ("x86/vector: Use matrix allocator for vector assignment")
      Reported-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
      Reported-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reported-by: default avatarSong Liu <liu.song.a23@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Cc: Mike Travis <mike.travis@hpe.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Link: https://lkml.kernel.org/r/20180604162224.303870257@linutronix.de
      80ae7b1a
    • Konrad Rzeszutek Wilk's avatar
      x86/bugs: Switch the selection of mitigation from CPU vendor to CPU features · 108fab4b
      Konrad Rzeszutek Wilk authored
      Both AMD and Intel can have SPEC_CTRL_MSR for SSBD.
      
      However AMD also has two more other ways of doing it - which
      are !SPEC_CTRL MSR ways.
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: kvm@vger.kernel.org
      Cc: KarimAllah Ahmed <karahmed@amazon.de>
      Cc: andrew.cooper3@citrix.com
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Link: https://lkml.kernel.org/r/20180601145921.9500-4-konrad.wilk@oracle.com
      108fab4b
    • Konrad Rzeszutek Wilk's avatar
      x86/bugs: Add AMD's SPEC_CTRL MSR usage · 6ac2f49e
      Konrad Rzeszutek Wilk authored
      The AMD document outlining the SSBD handling
      124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
      mentions that if CPUID 8000_0008.EBX[24] is set we should be using
      the SPEC_CTRL MSR (0x48) over the VIRT SPEC_CTRL MSR (0xC001_011f)
      for speculative store bypass disable.
      
      This in effect means we should clear the X86_FEATURE_VIRT_SSBD
      flag so that we would prefer the SPEC_CTRL MSR.
      
      See the document titled:
         124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
      
      A copy of this document is available at
         https://bugzilla.kernel.org/show_bug.cgi?id=199889Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Cc: kvm@vger.kernel.org
      Cc: KarimAllah Ahmed <karahmed@amazon.de>
      Cc: andrew.cooper3@citrix.com
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Link: https://lkml.kernel.org/r/20180601145921.9500-3-konrad.wilk@oracle.com
      6ac2f49e
    • Konrad Rzeszutek Wilk's avatar
      x86/bugs: Add AMD's variant of SSB_NO · 24809860
      Konrad Rzeszutek Wilk authored
      The AMD document outlining the SSBD handling
      124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
      mentions that the CPUID 8000_0008.EBX[26] will mean that the
      speculative store bypass disable is no longer needed.
      
      A copy of this document is available at:
          https://bugzilla.kernel.org/show_bug.cgi?id=199889Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Cc: kvm@vger.kernel.org
      Cc: andrew.cooper3@citrix.com
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Link: https://lkml.kernel.org/r/20180601145921.9500-2-konrad.wilk@oracle.com
      24809860
    • Dou Liyang's avatar
      x86/vector: Fix the args of vector_alloc tracepoint · 838d76d6
      Dou Liyang authored
      The vector_alloc tracepont reversed the reserved and ret aggs, that made
      the trace print wrong. Exchange them.
      
      Fixes: 8d1e3dca ("x86/vector: Add tracepoints for vector management")
      Signed-off-by: default avatarDou Liyang <douly.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180601065031.21872-1-douly.fnst@cn.fujitsu.com
      838d76d6
    • Dou Liyang's avatar
      x86/idt: Simplify the idt_setup_apic_and_irq_gates() · 33662812
      Dou Liyang authored
      The idt_setup_apic_and_irq_gates() sets the gates from
      FIRST_EXTERNAL_VECTOR up to FIRST_SYSTEM_VECTOR first. then secondly, from
      FIRST_SYSTEM_VECTOR to NR_VECTORS, it takes both APIC=y and APIC=n into
      account.
      
      But for APIC=n, the FIRST_SYSTEM_VECTOR is equal to NR_VECTORS, all
      vectors has been set at the first step.
      
      Simplify the second step, make it just work for APIC=y.
      Signed-off-by: default avatarDou Liyang <douly.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20180523023555.2933-1-douly.fnst@cn.fujitsu.com
      33662812
    • Varsha Rao's avatar
      x86/platform/uv: Remove extra parentheses · cce2946b
      Varsha Rao authored
      Remove extra parentheses to fix the extraneous parentheses clang
      warning.
      Suggested-by: Lukas Bulwahn's avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Signed-off-by: default avatarVarsha Rao <rvarsha016@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Nicholas Mc Guire <der.herr@hofr.at>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180520080012.8215-1-rvarsha016@gmail.com
      cce2946b
    • Kirill A. Shutemov's avatar
      x86/mm: Decouple dynamic __PHYSICAL_MASK from AMD SME · 94d49eb3
      Kirill A. Shutemov authored
      AMD SME claims one bit from physical address to indicate whether the
      page is encrypted or not. To achieve that we clear out the bit from
      __PHYSICAL_MASK.
      
      The capability to adjust __PHYSICAL_MASK is required beyond AMD SME.
      For instance for upcoming Intel Multi-Key Total Memory Encryption.
      
      Factor it out into a separate feature with own Kconfig handle.
      
      It also helps with overhead of AMD SME. It saves more than 3k in .text
      on defconfig + AMD_MEM_ENCRYPT:
      
      	add/remove: 3/2 grow/shrink: 5/110 up/down: 189/-3753 (-3564)
      
      We would need to return to this once we have infrastructure to patch
      constants in code. That's good candidate for it.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Cc: linux-mm@kvack.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180518113028.79825-1-kirill.shutemov@linux.intel.com
      94d49eb3
    • Arnd Bergmann's avatar
      x86: Mark native_set_p4d() as __always_inline · 046c0dbe
      Arnd Bergmann authored
      When CONFIG_OPTIMIZE_INLINING is enabled, the function native_set_p4d()
      may not be fully inlined into the caller, resulting in a false-positive
      warning about an access to the __pgtable_l5_enabled variable from a
      non-__init function, despite the original caller being an __init function:
      
      WARNING: vmlinux.o(.text.unlikely+0x1429): Section mismatch in reference from the function native_set_p4d() to the variable .init.data:__pgtable_l5_enabled
      WARNING: vmlinux.o(.text.unlikely+0x1429): Section mismatch in reference from the function native_p4d_clear() to the variable .init.data:__pgtable_l5_enabled
      
      The function native_set_p4d() references the variable __initdata
      __pgtable_l5_enabled.  This is often because native_set_p4d lacks a
      __initdata annotation or the annotation of __pgtable_l5_enabled is wrong.
      
      Marking the native_set_p4d function and its caller native_p4d_clear()
      avoids this problem.
      
      I did not bisect the original cause, but I assume this is related to the
      recent rework that turned pgtable_l5_enabled() into an inline function,
      which in turn caused the compiler to make different inlining decisions.
      
      Fixes: ad3fe525 ("x86/mm: Unify pgtable_l5_enabled usage in early boot code")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Link: https://lkml.kernel.org/r/20180605113715.1133726-1-arnd@arndb.de
      046c0dbe
    • Boqun Feng's avatar
      powerpc: Wire up restartable sequences system call · bb862b02
      Boqun Feng authored
      Wire up the rseq system call on powerpc.
      
      This provides an ABI improving the speed of a user-space getcpu
      operation on powerpc by skipping the getcpu system call on the fast
      path, as well as improving the speed of user-space operations on per-cpu
      data compared to using load-reservation/store-conditional atomics.
      Signed-off-by: default avatarBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: linux-api@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20180602124408.8430-11-mathieu.desnoyers@efficios.com
      bb862b02
    • Boqun Feng's avatar
      powerpc: Add syscall detection for restartable sequences · 6f37be4b
      Boqun Feng authored
      Syscalls are not allowed inside restartable sequences, so add a call to
      rseq_syscall() at the very beginning of system call exiting path for
      CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
      is a syscall issued inside restartable sequences.
      Signed-off-by: default avatarBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: linux-api@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20180602124408.8430-10-mathieu.desnoyers@efficios.com
      6f37be4b
    • Boqun Feng's avatar
      powerpc: Add support for restartable sequences · 8a417c48
      Boqun Feng authored
      Call the rseq_handle_notify_resume() function on return to userspace if
      TIF_NOTIFY_RESUME thread flag is set.
      
      Perform fixup on the pre-signal when a signal is delivered on top of a
      restartable sequence critical section.
      Signed-off-by: default avatarBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: linux-api@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20180602124408.8430-9-mathieu.desnoyers@efficios.com
      8a417c48
    • Mathieu Desnoyers's avatar
      x86: Wire up restartable sequence system call · 05c17ced
      Mathieu Desnoyers authored
      Wire up the rseq system call on x86 32/64.
      
      This provides an ABI improving the speed of a user-space getcpu
      operation on x86 by removing the need to perform a function call, "lsl"
      instruction, or system call on the fast path, as well as improving the
      speed of user-space operations on per-cpu data.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20180602124408.8430-8-mathieu.desnoyers@efficios.com
      05c17ced
    • Mathieu Desnoyers's avatar
      x86: Add support for restartable sequences · d6761b8f
      Mathieu Desnoyers authored
      Call the rseq_handle_notify_resume() function on return to userspace if
      TIF_NOTIFY_RESUME thread flag is set.
      
      Perform fixup on the pre-signal frame when a signal is delivered on top
      of a restartable sequence critical section.
      
      Check that system calls are not invoked from within rseq critical
      sections by invoking rseq_signal() from syscall_return_slowpath().
      With CONFIG_DEBUG_RSEQ, such behavior results in termination of the
      process with SIGSEGV.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20180602124408.8430-7-mathieu.desnoyers@efficios.com
      d6761b8f
    • Mathieu Desnoyers's avatar
      arm: Wire up restartable sequences system call · 338035ed
      Mathieu Desnoyers authored
      Wire up the rseq system call on 32-bit ARM.
      
      This provides an ABI improving the speed of a user-space getcpu
      operation on ARM by skipping the getcpu system call on the fast path, as
      well as improving the speed of user-space operations on per-cpu data
      compared to using load-linked/store-conditional.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20180602124408.8430-6-mathieu.desnoyers@efficios.com
      338035ed
    • Mathieu Desnoyers's avatar
      arm: Add syscall detection for restartable sequences · b74406f3
      Mathieu Desnoyers authored
      Syscalls are not allowed inside restartable sequences, so add a call to
      rseq_syscall() at the very beginning of system call exiting path for
      CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
      is a syscall issued inside restartable sequences.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20180602124408.8430-5-mathieu.desnoyers@efficios.com
      b74406f3
    • Mathieu Desnoyers's avatar
      arm: Add restartable sequences support · 9800b9dc
      Mathieu Desnoyers authored
      Call the rseq_handle_notify_resume() function on return to
      userspace if TIF_NOTIFY_RESUME thread flag is set.
      
      Perform fixup on the pre-signal frame when a signal is delivered on top
      of a restartable sequence critical section.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20180602124408.8430-4-mathieu.desnoyers@efficios.com
      9800b9dc
    • Mathieu Desnoyers's avatar
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers authored
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
    • Nicholas Piggin's avatar
      powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap · ff5bc793
      Nicholas Piggin authored
      There is a typo in f1cb8f9b ("powerpc/64s/radix: avoid ptesync after
      set_pte and ptep_set_access_flags") config ifdef, which results in the
      necessary ptesync not being issued after vmalloc.
      
      This causes random kernel faults in module load, bpf load, anywhere
      that vmalloc mappings are used.
      
      After correcting the code, this survives a guest kernel booting
      hundreds of times where previously there would be a crash every few
      boots (I haven't noticed the crash on host, perhaps due to different
      TLB and page table walking behaviour in hardware).
      
      A memory clobber is also added to the flush, just to be sure it won't
      be reordered with the pte set or the subsequent mapping access.
      
      Fixes: f1cb8f9b ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: Michael Ellerman's avatarMichael Ellerman <mpe@ellerman.id.au>
      ff5bc793
  6. 05 Jun, 2018 8 commits