1. 05 Apr, 2019 13 commits
    • Richard Guy Briggs's avatar
      audit: hand taken context to audit_kill_trees for syscall logging · e78d5e16
      Richard Guy Briggs authored
      [ Upstream commit 9e36a5d4 ]
      
      Since the context is derived from the task parameter handed to
      __audit_free(), hand the context to audit_kill_trees() so it can be used
      to associate with a syscall record.  This requires adding the context
      parameter to kill_rules() rather than using the current audit_context.
      
      The callers of trim_marked() and evict_chunk() still have their context.
      
      The EOE record was being issued prior to the pruning of the killed_tree
      list.
      
      Move the kill_trees call before the audit_log_exit call in
      __audit_free() and __audit_syscall_exit() so that any pruned trees
      CONFIG_CHANGE records are included with the associated syscall event by
      the user library due to the EOE record flagging the end of the event.
      
      See: https://github.com/linux-audit/audit-kernel/issues/50
      See: https://github.com/linux-audit/audit-kernel/issues/59Signed-off-by: 's avatarRichard Guy Briggs <rgb@redhat.com>
      [PM: fixed merge fuzz in kernel/audit_tree.c]
      Signed-off-by: 's avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      e78d5e16
    • Valentin Schneider's avatar
      cpu/hotplug: Mute hotplug lockdep during init · 964065d2
      Valentin Schneider authored
      [ Upstream commit ce48c457 ]
      
      Since we've had:
      
        commit cb538267 ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")
      
      we've been getting some lockdep warnings during init, such as on HiKey960:
      
      [    0.820495] WARNING: CPU: 4 PID: 0 at kernel/cpu.c:316 lockdep_assert_cpus_held+0x3c/0x48
      [    0.820498] Modules linked in:
      [    0.820509] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S                4.20.0-rc5-00051-g4cae42a #34
      [    0.820511] Hardware name: HiKey960 (DT)
      [    0.820516] pstate: 600001c5 (nZCv dAIF -PAN -UAO)
      [    0.820520] pc : lockdep_assert_cpus_held+0x3c/0x48
      [    0.820523] lr : lockdep_assert_cpus_held+0x38/0x48
      [    0.820526] sp : ffff00000a9cbe50
      [    0.820528] x29: ffff00000a9cbe50 x28: 0000000000000000
      [    0.820533] x27: 00008000b69e5000 x26: ffff8000bff4cfe0
      [    0.820537] x25: ffff000008ba69e0 x24: 0000000000000001
      [    0.820541] x23: ffff000008fce000 x22: ffff000008ba70c8
      [    0.820545] x21: 0000000000000001 x20: 0000000000000003
      [    0.820548] x19: ffff00000a35d628 x18: ffffffffffffffff
      [    0.820552] x17: 0000000000000000 x16: 0000000000000000
      [    0.820556] x15: ffff00000958f848 x14: 455f3052464d4d34
      [    0.820559] x13: 00000000769dde98 x12: ffff8000bf3f65a8
      [    0.820564] x11: 0000000000000000 x10: ffff00000958f848
      [    0.820567] x9 : ffff000009592000 x8 : ffff00000958f848
      [    0.820571] x7 : ffff00000818ffa0 x6 : 0000000000000000
      [    0.820574] x5 : 0000000000000000 x4 : 0000000000000001
      [    0.820578] x3 : 0000000000000000 x2 : 0000000000000001
      [    0.820582] x1 : 00000000ffffffff x0 : 0000000000000000
      [    0.820587] Call trace:
      [    0.820591]  lockdep_assert_cpus_held+0x3c/0x48
      [    0.820598]  static_key_enable_cpuslocked+0x28/0xd0
      [    0.820606]  arch_timer_check_ool_workaround+0xe8/0x228
      [    0.820610]  arch_timer_starting_cpu+0xe4/0x2d8
      [    0.820615]  cpuhp_invoke_callback+0xe8/0xd08
      [    0.820619]  notify_cpu_starting+0x80/0xb8
      [    0.820625]  secondary_start_kernel+0x118/0x1d0
      
      We've also had a similar warning in sched_init_smp() for every
      asymmetric system that would enable the sched_asym_cpucapacity static
      key, although that was singled out in:
      
        commit 40fa3780 ("sched/core: Take the hotplug lock in sched_init_smp()")
      
      Those warnings are actually harmless, since we cannot have hotplug
      operations at the time they appear. Instead of starting to sprinkle
      useless hotplug lock operations in the init codepaths, mute the
      warnings until they start warning about real problems.
      Suggested-by: 's avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: 's avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: 's avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: cai@gmx.us
      Cc: daniel.lezcano@linaro.org
      Cc: dietmar.eggemann@arm.com
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: longman@redhat.com
      Cc: marc.zyngier@arm.com
      Cc: mark.rutland@arm.com
      Link: https://lkml.kernel.org/r/1545243796-23224-2-git-send-email-valentin.schneider@arm.comSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      964065d2
    • Oleg Nesterov's avatar
      cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting · b8498a26
      Oleg Nesterov authored
      [ Upstream commit 51bee5ab ]
      
      The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
      needs pids_free() to uncharge the pid.
      
      However, ->free() is called from __put_task_struct()->cgroup_free() and this
      is too late. Even the trivial program which does
      
      	for (;;) {
      		int pid = fork();
      		assert(pid >= 0);
      		if (pid)
      			wait(NULL);
      		else
      			exit(0);
      	}
      
      can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
      implies an RCU gp after the task/pid goes away and before the final put().
      
      Test-case:
      
      	mkdir -p /tmp/CG
      	mount -t cgroup2 none /tmp/CG
      	echo '+pids' > /tmp/CG/cgroup.subtree_control
      
      	mkdir /tmp/CG/PID
      	echo 2 > /tmp/CG/PID/pids.max
      
      	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
      	echo $! > /tmp/CG/PID/cgroup.procs
      
      Without this patch the forking process fails soon after migration.
      
      Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
      into the new helper, cgroup_release(), called by release_task() which actually
      frees the pid(s).
      Reported-by: 's avatarHerton R. Krzesinski <hkrzesin@redhat.com>
      Reported-by: 's avatarJan Stancek <jstancek@redhat.com>
      Signed-off-by: 's avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      b8498a26
    • Andrea Parri's avatar
      sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() · 5ca05ecd
      Andrea Parri authored
      [ Upstream commit c546951d ]
      
      move_queued_task() synchronizes with task_rq_lock() as follows:
      
      	move_queued_task()		task_rq_lock()
      
      	[S] ->on_rq = MIGRATING		[L] rq = task_rq()
      	WMB (__set_task_cpu())		ACQUIRE (rq->lock);
      	[S] ->cpu = new_cpu		[L] ->on_rq
      
      where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an
      address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before
      "[L] ->on_rq" by the ACQUIRE itself.
      
      Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor
      this address dependency.  Also, mark the accesses to ->cpu and ->on_rq
      with READ_ONCE()/WRITE_ONCE() to comply with the LKMM.
      Signed-off-by: 's avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: 's avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.comSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      5ca05ecd
    • Hidetoshi Seto's avatar
      sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK · dd288d32
      Hidetoshi Seto authored
      [ Upstream commit 1ca4fa3a ]
      
      register_sched_domain_sysctl() copies the cpu_possible_mask into
      sd_sysctl_cpus, but only if sd_sysctl_cpus hasn't already been
      allocated (ie, CONFIG_CPUMASK_OFFSTACK is set).  However, when
      CONFIG_CPUMASK_OFFSTACK is not set, sd_sysctl_cpus is left
      uninitialized (all zeroes) and the kernel may fail to initialize
      sched_domain sysctl entries for all possible CPUs.
      
      This is visible to the user if the kernel is booted with maxcpus=n, or
      if ACPI tables have been modified to leave CPUs offline, and then
      checking for missing /proc/sys/kernel/sched_domain/cpu* entries.
      
      Fix this by separating the allocation and initialization, and adding a
      flag to initialize the possible CPU entries while system booting only.
      Tested-by: 's avatarSyuuichirou Ishii <ishii.shuuichir@jp.fujitsu.com>
      Tested-by: 's avatarTarumizu, Kohei <tarumizu.kohei@jp.fujitsu.com>
      Signed-off-by: 's avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: 's avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: 's avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Acked-by: Joe Lawrence's avatarJoe Lawrence <joe.lawrence@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190129151245.5073-1-msys.mizuma@gmail.comSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      dd288d32
    • Mathieu Poirier's avatar
      perf/aux: Make perf_event accessible to setup_aux() · 4e4fba6d
      Mathieu Poirier authored
      [ Upstream commit 84001866 ]
      
      When pmu::setup_aux() is called the coresight PMU needs to know which
      sink to use for the session by looking up the information in the
      event's attr::config2 field.
      
      As such simply replace the cpu information by the complete perf_event
      structure and change all affected customers.
      Signed-off-by: 's avatarMathieu Poirier <mathieu.poirier@linaro.org>
      Reviewed-by: 's avatarSuzuki Poulouse <suzuki.poulose@arm.com>
      Acked-by: 's avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-s390@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.orgSigned-off-by: 's avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      4e4fba6d
    • Thomas Gleixner's avatar
      genirq: Avoid summation loops for /proc/stat · c0ed0486
      Thomas Gleixner authored
      [ Upstream commit 1136b072 ]
      
      Waiman reported that on large systems with a large amount of interrupts the
      readout of /proc/stat takes a long time to sum up the interrupt
      statistics. In principle this is not a problem. but for unknown reasons
      some enterprise quality software reads /proc/stat with a high frequency.
      
      The reason for this is that interrupt statistics are accounted per cpu. So
      the /proc/stat logic has to sum up the interrupt stats for each interrupt.
      
      This can be largely avoided for interrupts which are not marked as
      'PER_CPU' interrupts by simply adding a per interrupt summation counter
      which is incremented along with the per interrupt per cpu counter.
      
      The PER_CPU interrupts need to avoid that and use only per cpu accounting
      because they share the interrupt number and the interrupt descriptor and
      concurrent updates would conflict or require unwanted synchronization.
      Reported-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Signed-off-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: Waiman Long's avatarWaiman Long <longman@redhat.com>
      Reviewed-by: 's avatarMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: 's avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Link: https://lkml.kernel.org/r/20190208135020.925487496@linutronix.de
      
      8<-------------
      
      v2: Undo the unintentional layout change of struct irq_desc.
      
       include/linux/irqdesc.h |    1 +
       kernel/irq/chip.c       |   12 ++++++++++--
       kernel/irq/internals.h  |    8 +++++++-
       kernel/irq/irqdesc.c    |    7 ++++++-
       4 files changed, 24 insertions(+), 4 deletions(-)
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      c0ed0486
    • Luc Van Oostenryck's avatar
      sched/topology: Fix percpu data types in struct sd_data & struct s_data · 83a6f919
      Luc Van Oostenryck authored
      [ Upstream commit 99687cdb ]
      
      The percpu members of struct sd_data and s_data are declared as:
      
      	struct ... ** __percpu member;
      
      So their type is:
      
      	__percpu pointer to pointer to struct ...
      
      But looking at how they're used, their type should be:
      
      	pointer to __percpu pointer to struct ...
      
      and they should thus be declared as:
      
      	struct ... * __percpu *member;
      
      So fix the placement of '__percpu' in the definition of these
      structures.
      
      This addresses a bunch of Sparse's warnings like:
      
      	warning: incorrect type in initializer (different address spaces)
      	  expected void const [noderef] <asn:3> *__vpp_verify
      	  got struct sched_domain **
      Signed-off-by: Luc Van Oostenryck's avatarLuc Van Oostenryck <luc.vanoostenryck@gmail.com>
      Signed-off-by: 's avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190118144936.79158-1-luc.vanoostenryck@gmail.comSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      83a6f919
    • Masami Hiramatsu's avatar
      kprobes: Prohibit probing on RCU debug routine · c4ea4a79
      Masami Hiramatsu authored
      [ Upstream commit a39f15b9 ]
      
      Since kprobe itself depends on RCU, probing on RCU debug
      routine can cause recursive breakpoint bugs.
      
      Prohibit probing on RCU debug routines.
      
      int3
       ->do_int3()
         ->ist_enter()
           ->RCU_LOCKDEP_WARN()
             ->debug_lockdep_rcu_enabled() -> int3
      Signed-off-by: 's avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/154998807741.31052.11229157537816341591.stgit@devboxSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      c4ea4a79
    • Tejun Heo's avatar
      cgroup, rstat: Don't flush subtree root unless necessary · 55d7152d
      Tejun Heo authored
      [ Upstream commit b4ff1b44 ]
      
      cgroup_rstat_cpu_pop_updated() is used to traverse the updated cgroups
      on flush.  While it was only visiting updated ones in the subtree, it
      was visiting @root unconditionally.  We can easily check whether @root
      is updated or not by looking at its ->updated_next just as with the
      cgroups in the subtree.
      
      * Remove the unnecessary cgroup_parent() test.  The system root cgroup
        is never updated and thus its ->updated_next is always NULL.  No
        need to test whether cgroup_parent() exists in addition to
        ->updated_next.
      
      * Terminate traverse if ->updated_next is NULL.  This can only happen
        for subtree @root and there's no reason to visit it if it's not
        marked updated.
      
      This reduces cpu consumption when reading a lot of rstat backed files.
      In a micro benchmark reading stat from ~1600 cgroups, the sys time was
      lowered by >40%.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      55d7152d
    • Dave Hansen's avatar
      mm/resource: Return real error codes from walk failures · 88f0ced0
      Dave Hansen authored
      [ Upstream commit 5cd401ac ]
      
      walk_system_ram_range() can return an error code either becuase
      *it* failed, or because the 'func' that it calls returned an
      error.  The memory hotplug does the following:
      
      	ret = walk_system_ram_range(..., func);
              if (ret)
      		return ret;
      
      and 'ret' makes it out to userspace, eventually.  The problem
      s, walk_system_ram_range() failues that result from *it* failing
      (as opposed to 'func') return -1.  That leads to a very odd
      -EPERM (-1) return code out to userspace.
      
      Make walk_system_ram_range() return -EINVAL for internal
      failures to keep userspace less confused.
      
      This return code is compatible with all the callers that I
      audited.
      Signed-off-by: 's avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: 's avatarBjorn Helgaas <bhelgaas@google.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: 's avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      88f0ced0
    • Christian Brauner's avatar
      sysctl: handle overflow for file-max · 4e4d4979
      Christian Brauner authored
      [ Upstream commit 32a5ad9c ]
      
      Currently, when writing
      
        echo 18446744073709551616 > /proc/sys/fs/file-max
      
      /proc/sys/fs/file-max will overflow and be set to 0.  That quickly
      crashes the system.
      
      This commit sets the max and min value for file-max.  The max value is
      set to long int.  Any higher value cannot currently be used as the
      percpu counters are long ints and not unsigned integers.
      
      Note that the file-max value is ultimately parsed via
      __do_proc_doulongvec_minmax().  This function does not report error when
      min or max are exceeded.  Which means if a value largen that long int is
      written userspace will not receive an error instead the old value will be
      kept.  There is an argument to be made that this should be changed and
      __do_proc_doulongvec_minmax() should return an error when a dedicated min
      or max value are exceeded.  However this has the potential to break
      userspace so let's defer this to an RFC patch.
      
      Link: http://lkml.kernel.org/r/20190107222700.15954-3-christian@brauner.ioSigned-off-by: Christian Brauner's avatarChristian Brauner <christian@brauner.io>
      Acked-by: 's avatarKees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      [christian@brauner.io: v4]
        Link: http://lkml.kernel.org/r/20190210203943.8227-3-christian@brauner.ioSigned-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      4e4d4979
    • Douglas Anderson's avatar
      tracing: kdb: Fix ftdump to not sleep · 043a4400
      Douglas Anderson authored
      [ Upstream commit 31b265b3 ]
      
      As reported back in 2016-11 [1], the "ftdump" kdb command triggers a
      BUG for "sleeping function called from invalid context".
      
      kdb's "ftdump" command wants to call ring_buffer_read_prepare() in
      atomic context.  A very simple solution for this is to add allocation
      flags to ring_buffer_read_prepare() so kdb can call it without
      triggering the allocation error.  This patch does that.
      
      Note that in the original email thread about this, it was suggested
      that perhaps the solution for kdb was to either preallocate the buffer
      ahead of time or create our own iterator.  I'm hoping that this
      alternative of adding allocation flags to ring_buffer_read_prepare()
      can be considered since it means I don't need to duplicate more of the
      core trace code into "trace_kdb.c" (for either creating my own
      iterator or re-preparing a ring allocator whose memory was already
      allocated).
      
      NOTE: another option for kdb is to actually figure out how to make it
      reuse the existing ftrace_dump() function and totally eliminate the
      duplication.  This sounds very appealing and actually works (the "sr
      z" command can be seen to properly dump the ftrace buffer).  The
      downside here is that ftrace_dump() fully consumes the trace buffer.
      Unless that is changed I'd rather not use it because it means "ftdump
      | grep xyz" won't be very useful to search the ftrace buffer since it
      will throw away the whole trace on the first grep.  A future patch to
      dump only the last few lines of the buffer will also be hard to
      implement.
      
      [1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com
      
      Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.orgReported-by: 's avatarBrian Norris <briannorris@chromium.org>
      Signed-off-by: 's avatarDouglas Anderson <dianders@chromium.org>
      Signed-off-by: 's avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      043a4400
  2. 03 Apr, 2019 4 commits
    • Xu Yu's avatar
      bpf: do not restore dst_reg when cur_state is freed · 046098f0
      Xu Yu authored
      commit 0803278b upstream.
      
      Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug.
      
      Call trace:
      
        dump_stack+0xbf/0x12e
        print_address_description+0x6a/0x280
        kasan_report+0x237/0x360
        sanitize_ptr_alu+0x85a/0x8d0
        adjust_ptr_min_max_vals+0x8f2/0x1ca0
        adjust_reg_min_max_vals+0x8ed/0x22e0
        do_check+0x1ca6/0x5d00
        bpf_check+0x9ca/0x2570
        bpf_prog_load+0xc91/0x1030
        __se_sys_bpf+0x61e/0x1f00
        do_syscall_64+0xc8/0x550
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fault injection trace:
      
        kfree+0xea/0x290
        free_func_state+0x4a/0x60
        free_verifier_state+0x61/0xe0
        push_stack+0x216/0x2f0	          <- inject failslab
        sanitize_ptr_alu+0x2b1/0x8d0
        adjust_ptr_min_max_vals+0x8f2/0x1ca0
        adjust_reg_min_max_vals+0x8ed/0x22e0
        do_check+0x1ca6/0x5d00
        bpf_check+0x9ca/0x2570
        bpf_prog_load+0xc91/0x1030
        __se_sys_bpf+0x61e/0x1f00
        do_syscall_64+0xc8/0x550
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      When kzalloc() fails in push_stack(), free_verifier_state() will free
      current verifier state. As push_stack() returns, dst_reg was restored
      if ptr_is_dst_reg is false. However, as member of the cur_state,
      dst_reg is also freed, and error occurs when dereferencing dst_reg.
      Simply fix it by testing ret of push_stack() before restoring dst_reg.
      
      Fixes: 979d63d5 ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Signed-off-by: 's avatarXu Yu <xuyu@linux.alibaba.com>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      046098f0
    • Thomas Gleixner's avatar
      cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n · c3bcf031
      Thomas Gleixner authored
      commit 206b9235 upstream.
      
      Tianyu reported a crash in a CPU hotplug teardown callback when booting a
      kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot
      parameter.
      
      It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken
      forever in case that a bringup callback fails. Unfortunately this issue was
      not recognized when the CPU hotplug code was reworked, so the shortcoming
      just stayed in place.
      
      When a bringup callback fails, the CPU hotplug code rolls back the
      operation and takes the CPU offline.
      
      The 'nosmt' command line argument uses a bringup failure to abort the
      bringup of SMT sibling CPUs. This partial bringup is required due to the
      MCE misdesign on Intel CPUs.
      
      With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but
      CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level
      teardown of a CPU including the synchronizations in various facilities like
      RCU, NOHZ and others.
      
      As a consequence the teardown callbacks which must be executed on the
      outgoing CPU within stop machine with interrupts disabled are executed on
      the control CPU in interrupt enabled and preemptible context causing the
      kernel to crash and burn. The pre state machine code has a different
      failure mode which is more subtle and resulting in a less obvious use after
      free crash because the control side frees resources which are still in use
      by the undead CPU.
      
      But this is not a x86 only problem. Any architecture which supports the
      SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less
      likely to be triggered because in 99.99999% of the cases all bringup
      callbacks succeed.
      
      The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on
      all architectures as the following architectures have either no hotplug
      support at all or not all subarchitectures support it:
      
       alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).
      
      Crashing the kernel in such a situation is not an acceptable state
      either.
      
      Implement a minimal rollback variant by limiting the teardown to the point
      where all regular teardown callbacks have been invoked and leave the CPU in
      the 'dead' idle state. This has the following consequences:
      
       - the CPU is brought down to the point where the stop_machine takedown
         would happen.
      
       - the CPU stays there forever and is idle
      
       - The CPU is cleared in the CPU active mask, but not in the CPU online
         mask which is a legit state.
      
       - Interrupts are not forced away from the CPU
      
       - All facilities which only look at online mask would still see it, but
         that is the case during normal hotplug/unplug operations as well. It's
         just a (way) longer time frame.
      
      This will expose issues, which haven't been exposed before or only seldom,
      because now the normally transient state of being non active but online is
      a permanent state. In testing this exposed already an issue vs. work queues
      where the vmstat code schedules work on the almost dead CPU which ends up
      in an unbound workqueue and triggers 'preemtible context' warnings. This is
      not a problem of this change, it merily exposes an already existing issue.
      Still this is better than crashing fully without a chance to debug it.
      
      This is mainly thought as workaround for those architectures which do not
      support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP.
      
      Fixes: 2e1a3483 ("cpu/hotplug: Split out the state walk into functions")
      Reported-by: 's avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: 's avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Acked-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Konrad Wilk <konrad.wilk@oracle.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Mukesh Ojha <mojha@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.deSigned-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c3bcf031
    • Thomas Gleixner's avatar
      watchdog: Respect watchdog cpumask on CPU hotplug · 53464ca9
      Thomas Gleixner authored
      commit 7dd47617 upstream.
      
      The rework of the watchdog core to use cpu_stop_work broke the watchdog
      cpumask on CPU hotplug.
      
      The watchdog_enable/disable() functions are now called unconditionally from
      the hotplug callback, i.e. even on CPUs which are not in the watchdog
      cpumask. As a consequence the watchdog can become unstoppable.
      
      Only invoke them when the plugged CPU is in the watchdog cpumask.
      
      Fixes: 9cf57731 ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
      Reported-by: Maxime Coquelin's avatarMaxime Coquelin <maxime.coquelin@redhat.com>
      Signed-off-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: Maxime Coquelin's avatarMaxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.deSigned-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53464ca9
    • Frank Rowand's avatar
      tracing: initialize variable in create_dyn_event() · 8bf47766
      Frank Rowand authored
      commit 3dee10da upstream.
      
      Fix compile warning in create_dyn_event(): 'ret' may be used uninitialized
      in this function [-Wuninitialized].
      
      Link: http://lkml.kernel.org/r/1553237900-8555-1-git-send-email-frowand.list@gmail.com
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Fixes: 5448d44c ("tracing: Add unified dynamic event framework")
      Signed-off-by: 's avatarFrank Rowand <frank.rowand@sony.com>
      Signed-off-by: 's avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8bf47766
  3. 27 Mar, 2019 2 commits
  4. 23 Mar, 2019 9 commits
  5. 10 Mar, 2019 1 commit
  6. 02 Mar, 2019 1 commit
  7. 01 Mar, 2019 1 commit
  8. 26 Feb, 2019 1 commit
  9. 22 Feb, 2019 1 commit
    • Alban Crequy's avatar
      bpf, lpm: fix lookup bug in map_delete_elem · 7c0cdf0b
      Alban Crequy authored
      trie_delete_elem() was deleting an entry even though it was not matching
      if the prefixlen was correct. This patch adds a check on matchlen.
      
      Reproducer:
      
      $ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1
      $ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01
      $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
      key: 10 00 00 00 aa bb cc dd  value: 01
      Found 1 element
      $ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff
      $ echo $?
      0
      $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
      Found 0 elements
      
      A similar reproducer is added in the selftests.
      
      Without the patch:
      
      $ sudo ./tools/testing/selftests/bpf/test_lpm_map
      test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed.
      Aborted
      
      With the patch: test_lpm_map runs without errors.
      
      Fixes: e454cf59 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE")
      Cc: Craig Gallek <kraig@google.com>
      Signed-off-by: 's avatarAlban Crequy <alban@kinvolk.io>
      Acked-by: 's avatarCraig Gallek <kraig@google.com>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      7c0cdf0b
  10. 21 Feb, 2019 1 commit
    • Johannes Weiner's avatar
      psi: avoid divide-by-zero crash inside virtual machines · 4e37504d
      Johannes Weiner authored
      We've been seeing hard-to-trigger psi crashes when running inside VM
      instances:
      
          divide error: 0000 [#1] SMP PTI
          Modules linked in: [...]
          CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
          Workqueue: events psi_clock
          RIP: 0010:psi_update_stats+0x270/0x490
          RSP: 0018:ffffc90001117e10 EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8
          RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000
          RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
          R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502
          FS:  0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           psi_clock+0x12/0x50
           process_one_work+0x1e0/0x390
           worker_thread+0x2b/0x3c0
           ? rescuer_thread+0x330/0x330
           kthread+0x113/0x130
           ? kthread_create_worker_on_cpu+0x40/0x40
           ? SyS_exit_group+0x10/0x10
           ret_from_fork+0x35/0x40
          Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48
      
      The Code-line points to `period` being 0 inside update_stats(), and we
      divide by that when calculating that period's pressure percentage.
      
      The elapsed period should never be 0.  The reason this can happen is due
      to an off-by-one in the idle time / missing period calculation combined
      with a coarse sched_clock() in the virtual machine.
      
      The target time for aggregation is advanced into the future on a fixed
      grid to prevent clock drift.  So when an aggregation runs after some idle
      period, we can not just set it to "now + psi_period", but have to
      calculate the downtime and advance the target time relative to itself.
      
      However, if the aggregator was disabled exactly one psi_period (ns), we
      drop one idle period in the calculation due to a > when we should do >=.
      In that case, next_update will be advanced from 'now - psi_period' to
      'now' when it should be moved to 'now + psi_period'.  The run finishes
      with last_update == next_update == sched_clock().
      
      With hardware clocks, this exact nanosecond match isn't likely in the
      first place; but if it does happen, the clock will still have moved on and
      the period non-zero by the time the worker runs.  A pointlessly short
      period, but besides the extra work, no harm no foul.  However, a slow
      sched_clock() like we have on VMs might not have advanced either by the
      time the worker runs again.  And when we calculate the elapsed period, the
      result, our pressure divisor, will be 0.  Ouch.
      
      Fix this by correctly handling the situation when the elapsed time between
      aggregation runs is precisely two periods, and advance the expiration
      timestamp correctly to period into the future.
      
      Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.orgSigned-off-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: Łukasz Siudut <lsiudut@fb.com
      Reviewed-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e37504d
  11. 15 Feb, 2019 2 commits
    • Quentin Perret's avatar
      tracing: Fix number of entries in trace header · 9e738215
      Quentin Perret authored
      The following commit
      
        441dae8f ("tracing: Add support for display of tgid in trace output")
      
      removed the call to print_event_info() from print_func_help_header_irq()
      which results in the ftrace header not reporting the number of entries
      written in the buffer. As this wasn't the original intent of the patch,
      re-introduce the call to print_event_info() to restore the orginal
      behaviour.
      
      Link: http://lkml.kernel.org/r/20190214152950.4179-1-quentin.perret@arm.comAcked-by: 's avatarJoel Fernandes <joelaf@google.com>
      Cc: stable@vger.kernel.org
      Fixes: 441dae8f ("tracing: Add support for display of tgid in trace output")
      Signed-off-by: 's avatarQuentin Perret <quentin.perret@arm.com>
      Signed-off-by: 's avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      9e738215
    • Du, Changbin's avatar
      kprobe: Do not use uaccess functions to access kernel memory that can fault · 2c4f1fcb
      Du, Changbin authored
      The userspace can ask kprobe to intercept strings at any memory address,
      including invalid kernel address. In this case, fetch_store_strlen()
      would crash since it uses general usercopy function, and user access
      functions are no longer allowed to access kernel memory.
      
      For example, we can crash the kernel by doing something as below:
      
      $ sudo kprobe 'p:do_sys_open +0(+0(%si)):string'
      
      [  103.620391] BUG: GPF in non-whitelisted uaccess (non-canonical address?)
      [  103.622104] general protection fault: 0000 [#1] SMP PTI
      [  103.623424] CPU: 10 PID: 1046 Comm: cat Not tainted 5.0.0-rc3-00130-gd73aba11-dirty #96
      [  103.625321] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-2-g628b2e6-dirty-20190104_103505-linux 04/01/2014
      [  103.628284] RIP: 0010:process_fetch_insn+0x1ab/0x4b0
      [  103.629518] Code: 10 83 80 28 2e 00 00 01 31 d2 31 ff 48 8b 74 24 28 eb 0c 81 fa ff 0f 00 00 7f 1c 85 c0 75 18 66 66 90 0f ae e8 48 63
       ca 89 f8 <8a> 0c 31 66 66 90 83 c2 01 84 c9 75 dc 89 54 24 34 89 44 24 28 48
      [  103.634032] RSP: 0018:ffff88845eb37ce0 EFLAGS: 00010246
      [  103.635312] RAX: 0000000000000000 RBX: ffff888456c4e5a8 RCX: 0000000000000000
      [  103.637057] RDX: 0000000000000000 RSI: 2e646c2f6374652f RDI: 0000000000000000
      [  103.638795] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      [  103.640556] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
      [  103.642297] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      [  103.644040] FS:  0000000000000000(0000) GS:ffff88846f000000(0000) knlGS:0000000000000000
      [  103.646019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  103.647436] CR2: 00007ffc79758038 CR3: 0000000463360006 CR4: 0000000000020ee0
      [  103.649147] Call Trace:
      [  103.649781]  ? sched_clock_cpu+0xc/0xa0
      [  103.650747]  ? do_sys_open+0x5/0x220
      [  103.651635]  kprobe_trace_func+0x303/0x380
      [  103.652645]  ? do_sys_open+0x5/0x220
      [  103.653528]  kprobe_dispatcher+0x45/0x50
      [  103.654682]  ? do_sys_open+0x1/0x220
      [  103.655875]  kprobe_ftrace_handler+0x90/0xf0
      [  103.657282]  ftrace_ops_assist_func+0x54/0xf0
      [  103.658564]  ? __call_rcu+0x1dc/0x280
      [  103.659482]  0xffffffffc00000bf
      [  103.660384]  ? __ia32_sys_open+0x20/0x20
      [  103.661682]  ? do_sys_open+0x1/0x220
      [  103.662863]  do_sys_open+0x5/0x220
      [  103.663988]  do_syscall_64+0x60/0x210
      [  103.665201]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  103.666862] RIP: 0033:0x7fc22fadccdd
      [  103.668034] Code: 48 89 54 24 e0 41 83 e2 40 75 32 89 f0 25 00 00 41 00 3d 00 00 41 00 74 24 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff
       ff 0f 05 <48> 3d 00 f0 ff ff 77 33 f3 c3 66 0f 1f 84 00 00 00 00 00 48 8d 44
      [  103.674029] RSP: 002b:00007ffc7972c3a8 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
      [  103.676512] RAX: ffffffffffffffda RBX: 0000562f86147a21 RCX: 00007fc22fadccdd
      [  103.678853] RDX: 0000000000080000 RSI: 00007fc22fae1428 RDI: 00000000ffffff9c
      [  103.681151] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
      [  103.683489] R10: 0000000000000000 R11: 0000000000000287 R12: 00007fc22fce90a8
      [  103.685774] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
      [  103.688056] Modules linked in:
      [  103.689131] ---[ end trace 43792035c28984a1 ]---
      
      This can be fixed by using probe_mem_read() instead, as it can handle faulting
      kernel memory addresses, which kprobes can legitimately do.
      
      Link: http://lkml.kernel.org/r/20190125151051.7381-1-changbin.du@gmail.com
      
      Cc: stable@vger.kernel.org
      Fixes: 9da3f2b7 ("x86/fault: BUG() when uaccess helpers fault on kernel addresses")
      Signed-off-by: Du, Changbin's avatarChangbin Du <changbin.du@gmail.com>
      Signed-off-by: 's avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      2c4f1fcb
  12. 13 Feb, 2019 2 commits
    • Eric W. Biederman's avatar
      signal: Restore the stop PTRACE_EVENT_EXIT · cf43a757
      Eric W. Biederman authored
      In the middle of do_exit() there is there is a call
      "ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
      in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
      for the debugger to release the task or SIGKILL to be delivered.
      
      Skipping past dequeue_signal when we know a fatal signal has already
      been delivered resulted in SIGKILL remaining pending and
      TIF_SIGPENDING remaining set.  This in turn caused the
      scheduler to not sleep in PTACE_EVENT_EXIT as it figured
      a fatal signal was pending.  This also caused ptrace_freeze_traced
      in ptrace_check_attach to fail because it left a per thread
      SIGKILL pending which is what fatal_signal_pending tests for.
      
      This difference in signal state caused strace to report
      strace: Exit of unknown pid NNNNN ignored
      
      Therefore update the signal handling state like dequeue_signal
      would when removing a per thread SIGKILL, by removing SIGKILL
      from the per thread signal mask and clearing TIF_SIGPENDING.
      Acked-by: 's avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: 's avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: 's avatarIvan Delalande <colona@arista.com>
      Cc: stable@vger.kernel.org
      Fixes: 35634ffa ("signal: Always notice exiting tasks")
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      cf43a757
    • Ingo Molnar's avatar
      perf/core: Fix impossible ring-buffer sizes warning · 528871b4
      Ingo Molnar authored
      The following commit:
      
        9dff0aa9 ("perf/core: Don't WARN() for impossible ring-buffer sizes")
      
      results in perf recording failures with larger mmap areas:
      
        root@skl:/tmp# perf record -g -a
        failed to mmap with 12 (Cannot allocate memory)
      
      The root cause is that the following condition is buggy:
      
      	if (order_base_2(size) >= MAX_ORDER)
      		goto fail;
      
      The problem is that @size is in bytes and MAX_ORDER is in pages,
      so the right test is:
      
      	if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER)
      		goto fail;
      
      Fix it.
      Reported-by: 's avatar"Jin, Yao" <yao.jin@linux.intel.com>
      Bisected-by: b0R1sLAV's avatarBorislav Petkov <bp@alien8.de>
      Analyzed-by: 's avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Julien Thierry <julien.thierry@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Fixes: 9dff0aa9 ("perf/core: Don't WARN() for impossible ring-buffer sizes")
      Signed-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      528871b4
  13. 11 Feb, 2019 2 commits
    • Andreas Ziegler's avatar
      tracing: probeevent: Correctly update remaining space in dynamic area · f6675872
      Andreas Ziegler authored
      Commit 9178412d ("tracing: probeevent: Return consumed
      bytes of dynamic area") improved the string fetching
      mechanism by returning the number of required bytes after
      copying the argument to the dynamic area. However, this
      return value is now only used to increment the pointer
      inside the dynamic area but misses updating the 'maxlen'
      variable which indicates the remaining space in the dynamic
      area.
      
      This means that fetch_store_string() always reads the *total*
      size of the dynamic area from the data_loc pointer instead of
      the *remaining* size (and passes it along to
      strncpy_from_{user,unsafe}) even if we're already about to
      copy data into the middle of the dynamic area.
      
      Link: http://lkml.kernel.org/r/20190206190013.16405-1-andreas.ziegler@fau.de
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 9178412d ("tracing: probeevent: Return consumed bytes of dynamic area")
      Acked-by: 's avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: 's avatarAndreas Ziegler <andreas.ziegler@fau.de>
      Signed-off-by: 's avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      f6675872
    • Alexei Starovoitov's avatar
      bpf: fix lockdep false positive in stackmap · 3defaf2f
      Alexei Starovoitov authored
      Lockdep warns about false positive:
      [   11.211460] ------------[ cut here ]------------
      [   11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
      [   11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280
      [   11.213134] Modules linked in:
      [   11.214954] RIP: 0010:lock_release+0x1ad/0x280
      [   11.223508] Call Trace:
      [   11.223705]  <IRQ>
      [   11.223874]  ? __local_bh_enable+0x7a/0x80
      [   11.224199]  up_read+0x1c/0xa0
      [   11.224446]  do_up_read+0x12/0x20
      [   11.224713]  irq_work_run_list+0x43/0x70
      [   11.225030]  irq_work_run+0x26/0x50
      [   11.225310]  smp_irq_work_interrupt+0x57/0x1f0
      [   11.225662]  irq_work_interrupt+0xf/0x20
      
      since rw_semaphore is released in a different task vs task that locked the sema.
      It is expected behavior.
      Fix the warning with up_read_non_owner() and rwsem_release() annotation.
      
      Fixes: bae77c5e ("bpf: enable stackmap with build_id in nmi context")
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      3defaf2f