1. 13 Jan, 2018 1 commit
    • Masami Hiramatsu's avatar
      error-injection: Support fault injection framework · 4b1a29a7
      Masami Hiramatsu authored
      Support in-kernel fault-injection framework via debugfs.
      This allows you to inject a conditional error to specified
      function using debugfs interfaces.
      Here is the result of test script described in
        # ./test_fail_function.sh
        1+0 records in
        1+0 records out
        1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
        btrfs-progs v4.4
        See http://btrfs.wiki.kernel.org for more information.
        Label:              (null)
        UUID:               bfa96010-12e9-4360-aed0-42eec7af5798
        Node size:          16384
        Sector size:        4096
        Filesystem size:    1001.00MiB
        Block group profiles:
          Data:             single            8.00MiB
          Metadata:         DUP              58.00MiB
          System:           DUP              12.00MiB
        SSD detected:       no
        Incompat features:  extref, skinny-metadata
        Number of devices:  1
           ID        SIZE  PATH
            1  1001.00MiB  /dev/loop2
        mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
  2. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      How this work was done:
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
      All documentation files were explicitly excluded.
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
         For non */uapi/* files that summary was:
         SPDX license identifier                            # files
         GPL-2.0                                              11139
         and resulted in the first patch in this series.
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                        930
         and resulted in the second patch in this series.
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
         and that resulted in the third patch in this series.
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  3. 09 Sep, 2017 2 commits
    • Luis R. Rodriguez's avatar
      kmod: move #ifdef CONFIG_MODULES wrapper to Makefile · 0ce2c202
      Luis R. Rodriguez authored
      The entire file is now conditionally compiled only when CONFIG_MODULES is
      enabled, and this this is a bool.  Just move this conditional to the
      Makefile as its easier to read this way.
      Link: http://lkml.kernel.org/r/20170810180618.22457-5-mcgrof@kernel.orgSigned-off-by: default avatarLuis R. Rodriguez <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Matt Redfearn <matt.redfearn@imgtec.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Daniel Mentz <danielmentz@google.com>
      Cc: David Binderman <dcb314@hotmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Luis R. Rodriguez's avatar
      kmod: split out umh code into its own file · 23558693
      Luis R. Rodriguez authored
      Patch series "kmod: few code cleanups to split out umh code"
      The usermode helper has a provenance from the old usb code which first
      required a usermode helper.  Eventually this was shoved into kmod.c and
      the kernel's modprobe calls was converted over eventually to share the
      same code.  Over time the list of usermode helpers in the kernel has grown
      -- so kmod is just but one user of the API.
      This series is a simple logical cleanup which acknowledges the code
      evolution of the usermode helper and shoves the UMH API into its own
      dedicated file.  This way users of the API can later just include umh.h
      instead of kmod.h.
      Note despite the diff state the first patch really is just a code shove,
      no functional changes are done there.  I did use git format-patch -M to
      generate the patch, but in the end the split was not enough for git to
      consider it a rename hence the large diffstat.
      I've put this through 0-day and it gives me their machine compilation
      blessings with all tests as OK.
      This patch (of 4):
      There's a slew of usermode helper users and kmod is just one of them.
      Split out the usermode helper code into its own file to keep the logic and
      focus split up.
      This change provides no functional changes.
      Link: http://lkml.kernel.org/r/20170810180618.22457-2-mcgrof@kernel.orgSigned-off-by: default avatarLuis R. Rodriguez <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Matt Redfearn <matt.redfearn@imgtec.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Daniel Mentz <danielmentz@google.com>
      Cc: David Binderman <dcb314@hotmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  4. 17 Aug, 2017 1 commit
    • Mathieu Desnoyers's avatar
      membarrier: Provide expedited private command · 22e4ebb9
      Mathieu Desnoyers authored
      Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
      from all runqueues for which current thread's mm is the same as the
      thread calling sys_membarrier. It executes faster than the non-expedited
      variant (no blocking). It also works on NOHZ_FULL configurations.
      Scheduler-wise, it requires a memory barrier before and after context
      switching between processes (which have different mm). The memory
      barrier before context switch is already present. For the barrier after
      context switch:
      * Our TSO archs can do RELEASE without being a full barrier. Look at
        x86 spin_unlock() being a regular STORE for example.  But for those
        archs, all atomics imply smp_mb and all of them have atomic ops in
        switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
      * From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
        the rest does indeed do smp_mb(), so there the spin_unlock() is a full
        barrier and we're good.
      * ARM64 has a very heavy barrier in switch_to(), which suffices.
      * PPC just removed its barrier from switch_to(), but appears to be
        talking about adding something to switch_mm(). So add a
        smp_mb__after_unlock_lock() for now, until this is settled on the PPC
      Changes since v3:
      - Properly document the memory barriers provided by each architecture.
      Changes since v2:
      - Address comments from Peter Zijlstra,
      - Add smp_mb__after_unlock_lock() after finish_lock_switch() in
        finish_task_switch() to add the memory barrier we need after storing
        to rq->curr. This is much simpler than the previous approach relying
        on atomic_dec_and_test() in mmdrop(), which actually added a memory
        barrier in the common case of switching between userspace processes.
      - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
        kernel, rather than having the whole membarrier system call returning
        -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
        Adapt the CMD_QUERY mask accordingly.
      Changes since v1:
      - move membarrier code under kernel/sched/ because it uses the
        scheduler runqueue,
      - only add the barrier when we switch from a kernel thread. The case
        where we switch from a user-space thread is already handled by
        the atomic_dec_and_test() in mmdrop().
      - add a comment to mmdrop() documenting the requirement on the implicit
        memory barrier.
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      CC: Boqun Feng <boqun.feng@gmail.com>
      CC: Andrew Hunter <ahh@google.com>
      CC: Maged Michael <maged.michael@gmail.com>
      CC: gromer@google.com
      CC: Avi Kivity <avi@scylladb.com>
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      CC: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarDave Watson <davejwatson@fb.com>
  5. 12 Jul, 2017 1 commit
  6. 09 May, 2017 1 commit
    • Hari Bathini's avatar
      crash: move crashkernel parsing and vmcore related code under CONFIG_CRASH_CORE · 692f66f2
      Hari Bathini authored
      Patch series "kexec/fadump: remove dependency with CONFIG_KEXEC and
      reuse crashkernel parameter for fadump", v4.
      Traditionally, kdump is used to save vmcore in case of a crash.  Some
      architectures like powerpc can save vmcore using architecture specific
      support instead of kexec/kdump mechanism.  Such architecture specific
      support also needs to reserve memory, to be used by dump capture kernel.
      crashkernel parameter can be a reused, for memory reservation, by such
      architecture specific infrastructure.
      This patchset removes dependency with CONFIG_KEXEC for crashkernel
      parameter and vmcoreinfo related code as it can be reused without kexec
      support.  Also, crashkernel parameter is reused instead of
      fadump_reserve_mem to reserve memory for fadump.
      The first patch moves crashkernel parameter parsing and vmcoreinfo
      related code under CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE.  The
      second patch reuses the definitions of append_elf_note() & final_note()
      functions under CONFIG_CRASH_CORE in IA64 arch code.  The third patch
      removes dependency on CONFIG_KEXEC for firmware-assisted dump (fadump)
      in powerpc.  The next patch reuses crashkernel parameter for reserving
      memory for fadump, instead of the fadump_reserve_mem parameter.  This
      has the advantage of using all syntaxes crashkernel parameter supports,
      for fadump as well.  The last patch updates fadump kernel documentation
      about use of crashkernel parameter.
      This patch (of 5):
      Traditionally, kdump is used to save vmcore in case of a crash.  Some
      architectures like powerpc can save vmcore using architecture specific
      support instead of kexec/kdump mechanism.  Such architecture specific
      support also needs to reserve memory, to be used by dump capture kernel.
      crashkernel parameter can be a reused, for memory reservation, by such
      architecture specific infrastructure.
      But currently, code related to vmcoreinfo and parsing of crashkernel
      parameter is built under CONFIG_KEXEC_CORE.  This patch introduces
      CONFIG_CRASH_CORE and moves the above mentioned code under this config,
      allowing code reuse without dependency on CONFIG_KEXEC.  There is no
      functional change with this patch.
      Link: http://lkml.kernel.org/r/149035338104.6881.4550894432615189948.stgit@hbathini.in.ibm.comSigned-off-by: default avatarHari Bathini <hbathini@linux.vnet.ibm.com>
      Acked-by: default avatarDave Young <dyoung@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  7. 27 Dec, 2016 1 commit
  8. 15 Dec, 2016 1 commit
  9. 14 Dec, 2016 1 commit
  10. 08 Aug, 2016 1 commit
  11. 24 May, 2016 1 commit
    • Ralf Baechle's avatar
      ELF/MIPS build fix · f43edca7
      Ralf Baechle authored
      CONFIG_MIPS32_N32=y but CONFIG_BINFMT_ELF disabled results in the
      following linker errors:
        arch/mips/built-in.o: In function `elf_core_dump':
        binfmt_elfn32.c:(.text+0x23dbc): undefined reference to `elf_core_extra_phdrs'
        binfmt_elfn32.c:(.text+0x246e4): undefined reference to `elf_core_extra_data_size'
        binfmt_elfn32.c:(.text+0x248d0): undefined reference to `elf_core_write_extra_phdrs'
        binfmt_elfn32.c:(.text+0x24ac4): undefined reference to `elf_core_write_extra_data'
      CONFIG_MIPS32_O32=y but CONFIG_BINFMT_ELF disabled results in the following
      linker errors:
        arch/mips/built-in.o: In function `elf_core_dump':
        binfmt_elfo32.c:(.text+0x28a04): undefined reference to `elf_core_extra_phdrs'
        binfmt_elfo32.c:(.text+0x29330): undefined reference to `elf_core_extra_data_size'
        binfmt_elfo32.c:(.text+0x2951c): undefined reference to `elf_core_write_extra_phdrs'
        binfmt_elfo32.c:(.text+0x29710): undefined reference to `elf_core_write_extra_data'
      This is because binfmt_elfn32 and binfmt_elfo32 are using symbols from
      elfcore but for these configurations elfcore will not be built.
      Fixed by making elfcore selectable by a separate config symbol which
      unlike the current mechanism can also be used from other directories
      than kernel/, then having each flavor of ELF that relies on elfcore.o,
      select it in Kconfig, including CONFIG_MIPS32_N32 and CONFIG_MIPS32_O32
      which fixes this issue.
      Link: http://lkml.kernel.org/r/20160520141705.GA1913@linux-mips.orgSigned-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Reviewed-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  12. 22 Mar, 2016 1 commit
    • Dmitry Vyukov's avatar
      kernel: add kcov code coverage · 5c9a8750
      Dmitry Vyukov authored
      kcov provides code coverage collection for coverage-guided fuzzing
      (randomized testing).  Coverage-guided fuzzing is a testing technique
      that uses coverage feedback to determine new interesting inputs to a
      system.  A notable user-space example is AFL
      (http://lcamtuf.coredump.cx/afl/).  However, this technique is not
      widely used for kernel testing due to missing compiler and kernel
      kcov does not aim to collect as much coverage as possible.  It aims to
      collect more or less stable coverage that is function of syscall inputs.
      To achieve this goal it does not collect coverage in soft/hard
      interrupts and instrumentation of some inherently non-deterministic or
      non-interesting parts of kernel is disbled (e.g.  scheduler, locking).
      Currently there is a single coverage collection mode (tracing), but the
      API anticipates additional collection modes.  Initially I also
      implemented a second mode which exposes coverage in a fixed-size hash
      table of counters (what Quentin used in his original patch).  I've
      dropped the second mode for simplicity.
      This patch adds the necessary support on kernel side.  The complimentary
      compiler support was added in gcc revision 231296.
      We've used this support to build syzkaller system call fuzzer, which has
      found 90 kernel bugs in just 2 months:
      We've also found 30+ bugs in our internal systems with syzkaller.
      Another (yet unexplored) direction where kcov coverage would greatly
      help is more traditional "blob mutation".  For example, mounting a
      random blob as a filesystem, or receiving a random blob over wire.
      Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
      coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
      typical coverage can be just a dozen of basic blocks (e.g.  an invalid
      input).  In such context gcov becomes prohibitively expensive as
      reset/collect coverage steps depend on total number of basic
      blocks/edges in program (in case of kernel it is about 2M).  Cost of
      kcov depends only on number of executed basic blocks/edges.  On top of
      that, kernel requires per-thread coverage because there are always
      background threads and unrelated processes that also produce coverage.
      With inlined gcov instrumentation per-thread coverage is not possible.
      kcov exposes kernel PCs and control flow to user-space which is
      insecure.  But debugfs should not be mapped as user accessible.
      Based on a patch by Quentin Casasnovas.
      [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
      [akpm@linux-foundation.org: unbreak allmodconfig]
      [akpm@linux-foundation.org: follow x86 Makefile layout standards]
      Signed-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tavis Ormandy <taviso@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Drysdale <drysdale@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  13. 31 Jan, 2016 1 commit
  14. 11 Sep, 2015 1 commit
    • Mathieu Desnoyers's avatar
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers authored
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      To explain the benefit of this scheme, let's introduce two example threads:
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      Before the change, we had, for each smp_mb() pairs:
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      After the change, these pairs become:
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      1) Non-concurrent Thread A vs Thread B accesses:
      Thread A                    Thread B
      prev mem accesses
      follow mem accesses
                                  prev mem accesses
                                  follow mem accesses
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      2) Concurrent Thread A vs Thread B accesses
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      * Benchmarks
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      * User-space user of this system call: Userspace RCU library
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      Results in liburcu:
      Operations in 10s, 6 readers, 2 writers:
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      This patch adds the system call to x86 and to asm-generic.
      [1] http://urcu.so
      membarrier(2) man page:
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
             membarrier - issue memory barriers on a set of threads
             #include <linux/membarrier.h>
             int membarrier(int cmd, int flags);
             The cmd argument is one of the following:
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
             The flags argument needs to be 0. For future extensions.
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
             The pair ordering is detailed as (O: ordered, X: not ordered):
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
             ENOSYS System call is not implemented.
             EINVAL Invalid arguments.
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: Josh Triplett's avatarJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  15. 10 Sep, 2015 2 commits
    • Dave Young's avatar
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young authored
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Dave Young's avatar
      kexec: split kexec_file syscall code to kexec_file.c · a43cac0d
      Dave Young authored
      Split kexec_file syscall related code to another file kernel/kexec_file.c
      so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.
      Sharing variables and functions are moved to kernel/kexec_internal.h per
      suggestion from Vivek and Petr.
      [akpm@linux-foundation.org: fix bisectability]
      [akpm@linux-foundation.org: declare the various arch_kexec functions]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  16. 14 Aug, 2015 2 commits
    • Dan Williams's avatar
      arch: introduce memremap() · 92281dee
      Dan Williams authored
      Existing users of ioremap_cache() are mapping memory that is known in
      advance to not have i/o side effects.  These users are forced to cast
      away the __iomem annotation, or otherwise neglect to fix the sparse
      errors thrown when dereferencing pointers to this memory.  Provide
      memremap() as a non __iomem annotated ioremap_*() in the case when
      ioremap is otherwise a pointer to cacheable memory. Empirically,
      ioremap_<cacheable-type>() call sites are seeking memory-like semantics
      (e.g.  speculative reads, and prefetching permitted).
      memremap() is a break from the ioremap implementation pattern of adding
      a new memremap_<type>() for each mapping type and having silent
      compatibility fall backs.  Instead, the implementation defines flags
      that are passed to the central memremap() and if a mapping type is not
      supported by an arch memremap returns NULL.
      We introduce a memremap prototype as a trivial wrapper of
      ioremap_cache() and ioremap_wt().  Later, once all ioremap_cache() and
      ioremap_wt() usage has been removed from drivers we teach archs to
      implement arch_memremap() with the ability to strictly enforce the
      mapping type.
      Cc: Arnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • David Howells's avatar
      Move certificate handling to its own directory · cfc411e7
      David Howells authored
      Move certificate handling out of the kernel/ directory and into a certs/
      directory to get all the weird stuff in one place and move the generated
      signing keys into this directory.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
  17. 12 Aug, 2015 1 commit
  18. 07 Aug, 2015 4 commits
  19. 06 Aug, 2015 1 commit
    • Richard Guy Briggs's avatar
      audit: clean simple fsnotify implementation · 7f492942
      Richard Guy Briggs authored
      This is to be used to audit by executable path rules, but audit watches should
      be able to share this code eventually.
      At the moment the audit watch code is a lot more complex.  That code only
      creates one fsnotify watch per parent directory.  That 'audit_parent' in
      turn has a list of 'audit_watches' which contain the name, ino, dev of
      the specific object we care about.  This just creates one fsnotify watch
      per object we care about.  So if you watch 100 inodes in /etc this code
      will create 100 fsnotify watches on /etc.  The audit_watch code will
      instead create 1 fsnotify watch on /etc (the audit_parent) and then 100
      individual watches chained from that fsnotify mark.
      We should be able to convert the audit_watch code to do one fsnotify
      mark per watch and simplify things/remove a whole lot of code.  After
      that conversion we should be able to convert the audit_fsnotify code to
      support that hierarchy if the optimization is necessary.
      Move the access to the entry for audit_match_signal() to the beginning of
      the audit_del_rule() function in case the entry found is the same one passed
      in.  This will enable it to be used by audit_autoremove_mark_rule(),
      kill_rules() and audit_remove_parent_watches().
      This is a heavily modified and merged version of two patches originally
      submitted by Eric Paris.
      Cc: Peter Moody <peter@hda3.com>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: default avatarRichard Guy Briggs <rgb@redhat.com>
      [PM: added a space after a declaration to keep ./scripts/checkpatch happy]
      Signed-off-by: default avatarPaul Moore <pmoore@redhat.com>
  20. 14 Jul, 2015 1 commit
    • Aleksa Sarai's avatar
      cgroup: implement the PIDs subsystem · 49b786ea
      Aleksa Sarai authored
      Adds a new single-purpose PIDs subsystem to limit the number of
      tasks that can be forked inside a cgroup. Essentially this is an
      implementation of RLIMIT_NPROC that applies to a cgroup rather than a
      process tree.
      However, it should be noted that organisational operations (adding and
      removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
      the number of tasks in the hierarchy cannot exceed the limit through
      forking. This is due to the fact that, in the unified hierarchy, attach
      cannot fail (and it is not possible for a task to overcome its PIDs
      cgroup policy limit by attaching to a child cgroup -- even if migrating
      mid-fork it must be able to fork in the parent first).
      PIDs are fundamentally a global resource, and it is possible to reach
      PID exhaustion inside a cgroup without hitting any reasonable kmemcg
      policy. Once you've hit PID exhaustion, you're only in a marginally
      better state than OOM. This subsystem allows PID exhaustion inside a
      cgroup to be prevented.
      Signed-off-by: Aleksa Sarai's avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  21. 02 Jul, 2015 1 commit
  22. 30 Apr, 2015 1 commit
  23. 15 Apr, 2015 1 commit
    • iulia's avatar
      kernel: conditionally support non-root users, groups and capabilities · 2813893f
      iulia authored
      There are a lot of embedded systems that run most or all of their
      functionality in init, running as root:root.  For these systems,
      supporting multiple users is not necessary.
      This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
      non-root users, non-root groups, and capabilities optional.  It is enabled
      under CONFIG_EXPERT menu.
      When this symbol is not defined, UID and GID are zero in any possible case
      and processes always have all capabilities.
      The following syscalls are compiled out: setuid, setregid, setgid,
      setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
      getgroups, setfsuid, setfsgid, capget, capset.
      Also, groups.c is compiled out completely.
      In kernel/capability.c, capable function was moved in order to avoid
      adding two ifdef blocks.
      This change saves about 25 KB on a defconfig build.  The most minimal
      kernels have total text sizes in the high hundreds of kB rather than
      low MB.  (The 25k goes down a bit with allnoconfig, but not that much.
      The kernel was booted in Qemu.  All the common functionalities work.
      Adding users/groups is not possible, failing with -ENOSYS.
      Bloat-o-meter output:
      add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: iulia's avatarIulia Manda <iulia.manda21@gmail.com>
      Reviewed-by: Josh Triplett's avatarJosh Triplett <josh@joshtriplett.org>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  24. 29 Jan, 2015 1 commit
  25. 23 Jan, 2015 1 commit
  26. 22 Dec, 2014 1 commit
  27. 11 Dec, 2014 1 commit
  28. 27 Oct, 2014 1 commit
    • Alexei Starovoitov's avatar
      bpf: split eBPF out of NET · f89b7755
      Alexei Starovoitov authored
      introduce two configs:
      - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
        depend on
      - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use
      that solves several problems:
      - tracing and others that wish to use eBPF don't need to depend on NET.
        They can use BPF_SYSCALL to allow loading from userspace or select BPF
        to use it directly from kernel in NET-less configs.
      - in 3.18 programs cannot be attached to events yet, so don't force it on
      - when the rest of eBPF infra is there in 3.19+, it's still useful to
        switch it off to minimize kernel size
      bloat-o-meter on x64 shows:
      add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)
      tested with many different config combinations. Hopefully didn't miss anything.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  29. 08 Aug, 2014 1 commit
    • Vivek Goyal's avatar
      bin2c: move bin2c in scripts/basic · 8370edea
      Vivek Goyal authored
      This patch series does not do kernel signature verification yet.  I plan
      to post another patch series for that.  Now distributions are already
      signing PE/COFF bzImage with PKCS7 signature I plan to parse and verify
      those signatures.
      Primary goal of this patchset is to prepare groundwork so that kernel
      image can be signed and signatures be verified during kexec load.  This
      should help with two things.
      - It should allow kexec/kdump on secureboot enabled machines.
      - In general it can help even without secureboot. By being able to verify
        kernel image signature in kexec, it should help with avoiding module
        signing restrictions. Matthew Garret showed how to boot into a custom
        kernel, modify first kernel's memory and then jump back to old kernel and
        bypass any policy one wants to.
      This patch (of 15):
      Kexec wants to use bin2c and it wants to use it really early in the build
      process. See arch/x86/purgatory/ code in later patches.
      So move bin2c in scripts/basic so that it can be built very early and
      be usable by arch/x86/purgatory/
      Signed-off-by: Vivek Goyal's avatarVivek Goyal <vgoyal@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  30. 24 Jul, 2014 1 commit
  31. 23 Jun, 2014 1 commit
  32. 23 Feb, 2014 1 commit
  33. 14 Feb, 2014 1 commit
  34. 11 Feb, 2014 1 commit