1. 20 Feb, 2019 2 commits
    • Zachary Hays's avatar
      mmc: block: handle complete_work on separate workqueue · 1fdf7b8b
      Zachary Hays authored
      commit dcf6e2e3 upstream.
      
      The kblockd workqueue is created with the WQ_MEM_RECLAIM flag set.
      This generates a rescuer thread for that queue that will trigger when
      the CPU is under heavy load and collect the uncompleted work.
      
      In the case of mmc, this creates the possibility of a deadlock when
      there are multiple partitions on the device as other blk-mq work is
      also run on the same queue. For example:
      
      - worker 0 claims the mmc host to work on partition 1
      - worker 1 attempts to claim the host for partition 2 but has to wait
        for worker 0 to finish
      - worker 0 schedules complete_work to release the host
      - rescuer thread is triggered after time-out and collects the dangling
        work
      - rescuer thread attempts to complete the work in order starting with
        claim host
      - the task to release host is now blocked by a task to claim it and
        will never be called
      
      The above results in multiple hung tasks that lead to failures to
      mount partitions.
      
      Handling complete_work on a separate workqueue avoids this by keeping
      the work completion tasks separate from the other blk-mq work. This
      allows the host to be released without getting blocked by other tasks
      attempting to claim the host.
      Signed-off-by: 's avatarZachary Hays <zhays@lexmark.com>
      Fixes: 81196976 ("mmc: block: Add blk-mq support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1fdf7b8b
    • Jiri Olsa's avatar
      perf/x86: Add check_period PMU callback · 6a66c2d0
      Jiri Olsa authored
      commit 81ec3f3c upstream.
      
      Vince (and later on Ravi) reported crashes in the BTS code during
      fuzzing with the following backtrace:
      
        general protection fault: 0000 [#1] SMP PTI
        ...
        RIP: 0010:perf_prepare_sample+0x8f/0x510
        ...
        Call Trace:
         <IRQ>
         ? intel_pmu_drain_bts_buffer+0x194/0x230
         intel_pmu_drain_bts_buffer+0x160/0x230
         ? tick_nohz_irq_exit+0x31/0x40
         ? smp_call_function_single_interrupt+0x48/0xe0
         ? call_function_single_interrupt+0xf/0x20
         ? call_function_single_interrupt+0xa/0x20
         ? x86_schedule_events+0x1a0/0x2f0
         ? x86_pmu_commit_txn+0xb4/0x100
         ? find_busiest_group+0x47/0x5d0
         ? perf_event_set_state.part.42+0x12/0x50
         ? perf_mux_hrtimer_restart+0x40/0xb0
         intel_pmu_disable_event+0xae/0x100
         ? intel_pmu_disable_event+0xae/0x100
         x86_pmu_stop+0x7a/0xb0
         x86_pmu_del+0x57/0x120
         event_sched_out.isra.101+0x83/0x180
         group_sched_out.part.103+0x57/0xe0
         ctx_sched_out+0x188/0x240
         ctx_resched+0xa8/0xd0
         __perf_event_enable+0x193/0x1e0
         event_function+0x8e/0xc0
         remote_function+0x41/0x50
         flush_smp_call_function_queue+0x68/0x100
         generic_smp_call_function_single_interrupt+0x13/0x30
         smp_call_function_single_interrupt+0x3e/0xe0
         call_function_single_interrupt+0xf/0x20
         </IRQ>
      
      The reason is that while event init code does several checks
      for BTS events and prevents several unwanted config bits for
      BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
      to create BTS event without those checks being done.
      
      Following sequence will cause the crash:
      
      If we create an 'almost' BTS event with precise_ip and callchains,
      and it into a BTS event it will crash the perf_prepare_sample()
      function because precise_ip events are expected to come
      in with callchain data initialized, but that's not the
      case for intel_pmu_drain_bts_buffer() caller.
      
      Adding a check_period callback to be called before the period
      is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
      if the event would become BTS. Plus adding also the limit_period
      check as well.
      Reported-by: 's avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: 's avatarJiri Olsa <jolsa@kernel.org>
      Acked-by: 's avatarPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190204123532.GA4794@kravaSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a66c2d0
  2. 12 Feb, 2019 10 commits
    • Josh Poimboeuf's avatar
      cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM · f29a8be0
      Josh Poimboeuf authored
      commit b284909a upstream.
      
      With the following commit:
      
        73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      
      ... the hotplug code attempted to detect when SMT was disabled by BIOS,
      in which case it reported SMT as permanently disabled.  However, that
      code broke a virt hotplug scenario, where the guest is booted with only
      primary CPU threads, and a sibling is brought online later.
      
      The problem is that there doesn't seem to be a way to reliably
      distinguish between the HW "SMT disabled by BIOS" case and the virt
      "sibling not yet brought online" case.  So the above-mentioned commit
      was a bit misguided, as it permanently disabled SMT for both cases,
      preventing future virt sibling hotplugs.
      
      Going back and reviewing the original problems which were attempted to
      be solved by that commit, when SMT was disabled in BIOS:
      
        1) /sys/devices/system/cpu/smt/control showed "on" instead of
           "notsupported"; and
      
        2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
      
      I'd propose that we instead consider #1 above to not actually be a
      problem.  Because, at least in the virt case, it's possible that SMT
      wasn't disabled by BIOS and a sibling thread could be brought online
      later.  So it makes sense to just always default the smt control to "on"
      to allow for that possibility (assuming cpuid indicates that the CPU
      supports SMT).
      
      The real problem is #2, which has a simple fix: change vmx_vm_init() to
      query the actual current SMT state -- i.e., whether any siblings are
      currently online -- instead of looking at the SMT "control" sysfs value.
      
      So fix it by:
      
        a) reverting the original "fix" and its followup fix:
      
           73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
           bc2d8d26 ("cpu/hotplug: Fix SMT supported evaluation")
      
           and
      
        b) changing vmx_vm_init() to query the actual current SMT state --
           instead of the sysfs control value -- to determine whether the L1TF
           warning is needed.  This also requires the 'sched_smt_present'
           variable to exported, instead of 'cpu_smt_control'.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: 's avatarIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: 's avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.comSigned-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      f29a8be0
    • Vladis Dronov's avatar
      HID: debug: fix the ring buffer implementation · a8d5fb2f
      Vladis Dronov authored
      commit 13054abb upstream.
      
      Ring buffer implementation in hid_debug_event() and hid_debug_events_read()
      is strange allowing lost or corrupted data. After commit 717adfda
      ("HID: debug: check length before copy_to_user()") it is possible to enter
      an infinite loop in hid_debug_events_read() by providing 0 as count, this
      locks up a system. Fix this by rewriting the ring buffer implementation
      with kfifo and simplify the code.
      
      This fixes CVE-2019-3819.
      
      v2: fix an execution logic and add a comment
      v3: use __set_current_state() instead of set_current_state()
      
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1669187
      Cc: stable@vger.kernel.org # v4.18+
      Fixes: cd667ce2 ("HID: use debugfs for events/reports dumping")
      Fixes: 717adfda ("HID: debug: check length before copy_to_user()")
      Signed-off-by: 's avatarVladis Dronov <vdronov@redhat.com>
      Reviewed-by: 's avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: 's avatarBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8d5fb2f
    • Takashi Iwai's avatar
      ALSA: hda - Serialize codec registrations · 970ad7a2
      Takashi Iwai authored
      commit 305a0ade upstream.
      
      In the current code, the codec registration may happen both at the
      codec bind time and the end of the controller probe time.  In a rare
      occasion, they race with each other, leading to Oops due to the still
      uninitialized card device.
      
      This patch introduces a simple flag to prevent the codec registration
      at the codec bind time as long as the controller probe is going on.
      The controller probe invokes snd_card_register() that does the whole
      registration task, and we don't need to register each piece
      beforehand.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      970ad7a2
    • Charles Keepax's avatar
      ALSA: compress: Fix stop handling on compressed capture streams · 18a47b48
      Charles Keepax authored
      commit 4f2ab5e1 upstream.
      
      It is normal user behaviour to start, stop, then start a stream
      again without closing it. Currently this works for compressed
      playback streams but not capture ones.
      
      The states on a compressed capture stream go directly from OPEN to
      PREPARED, unlike a playback stream which moves to SETUP and waits
      for a write of data before moving to PREPARED. Currently however,
      when a stop is sent the state is set to SETUP for both types of
      streams. This leaves a capture stream in the situation where a new
      start can't be sent as that requires the state to be PREPARED and
      a new set_params can't be sent as that requires the state to be
      OPEN. The only option being to close the stream, and then reopen.
      
      Correct this issues by allowing snd_compr_drain_notify to set the
      state depending on the stream direction, as we already do in
      set_params.
      
      Fixes: 49bb6402 ("ALSA: compress_core: Add support for capture streams")
      Signed-off-by: 's avatarCharles Keepax <ckeepax@opensource.cirrus.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18a47b48
    • Jim Mattson's avatar
      kvm: Change offset in kvm_write_guest_offset_cached to unsigned · 0635be29
      Jim Mattson authored
      [ Upstream commit 7a86dab8 ]
      
      Since the offset is added directly to the hva from the
      gfn_to_hva_cache, a negative offset could result in an out of bounds
      write. The existing BUG_ON only checks for addresses beyond the end of
      the gfn_to_hva_cache, not for addresses before the start of the
      gfn_to_hva_cache.
      
      Note that all current call sites have non-negative offsets.
      
      Fixes: 4ec6e863 ("kvm: Introduce kvm_write_guest_offset_cached()")
      Reported-by: 's avatarCfir Cohen <cfir@google.com>
      Signed-off-by: 's avatarJim Mattson <jmattson@google.com>
      Reviewed-by: 's avatarCfir Cohen <cfir@google.com>
      Reviewed-by: 's avatarPeter Shier <pshier@google.com>
      Reviewed-by: 's avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: 's avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: 's avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      0635be29
    • John Fastabend's avatar
      bpf: sk_msg, fix socket data_ready events · 47c12a8a
      John Fastabend authored
      [ Upstream commit 552de910 ]
      
      When a skb verdict program is in-use and either another BPF program
      redirects to that socket or the new SK_PASS support is used the
      data_ready callback does not wake up application. Instead because
      the stream parser/verdict is using the sk data_ready callback we wake
      up the stream parser/verdict block.
      
      Fix this by adding a helper to check if the stream parser block is
      enabled on the sk and if so call the saved pointer which is the
      upper layers wake up function.
      
      This fixes application stalls observed when an application is waiting
      for data in a blocking read().
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: 's avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      47c12a8a
    • Nathan Chancellor's avatar
      drbd: Avoid Clang warning about pointless switch statment · bee009d6
      Nathan Chancellor authored
      [ Upstream commit a52c5a16 ]
      
      There are several warnings from Clang about no case statement matching
      the constant 0:
      
      In file included from drivers/block/drbd/drbd_receiver.c:48:
      In file included from drivers/block/drbd/drbd_int.h:48:
      In file included from ./include/linux/drbd_genl_api.h:54:
      In file included from ./include/linux/genl_magic_struct.h:236:
      ./include/linux/drbd_genl.h:321:1: warning: no case matching constant
      switch condition '0'
      GENL_struct(DRBD_NLA_HELPER, 24, drbd_helper_info,
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ./include/linux/genl_magic_struct.h:220:10: note: expanded from macro
      'GENL_struct'
              switch (0) {
                      ^
      
      Silence this warning by adding a 'case 0:' statement. Additionally,
      adjust the alignment of the statements in the ct_assert_unique macro to
      avoid a checkpatch warning.
      
      This solution was originally sent by Arnd Bergmann with a default case
      statement: https://lore.kernel.org/patchwork/patch/756723/
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/43Suggested-by: 's avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: Nathan Chancellor's avatarNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      bee009d6
    • Parav Pandit's avatar
      RDMA/core: Sync unregistration with netlink commands · f9acb020
      Parav Pandit authored
      [ Upstream commit 01b67117 ]
      
      When the rdma device is getting removed, get resource info can race with
      device removal, as below:
      
            CPU-0                                  CPU-1
          --------                               --------
          rdma_nl_rcv_msg()
             nldev_res_get_cq_dumpit()
                mutex_lock(device_lock);
                get device reference
                mutex_unlock(device_lock);        [..]
                                                  ib_unregister_device()
                                                  /* Valid reference to
                                                   * device->dev exists.
                                                   */
                                                   ib_dealloc_device()
      
                [..]
                provider->fill_res_entry();
      
      Even though device object is not freed, fill_res_entry() can get called on
      device which doesn't have a driver anymore. Kernel core device reference
      count is not sufficient, as this only keeps the structure valid, and
      doesn't guarantee the driver is still loaded.
      
      Similar race can occur with device renaming and device removal, where
      device_rename() tries to rename a unregistered device. While this is fine
      for devices of a class which are not net namespace aware, but it is
      incorrect for net namespace aware class coming in subsequent series.  If a
      class is net namespace aware, then the below [1] call trace is observed in
      above situation.
      
      Therefore, to avoid the race, keep a reference count and let device
      unregistration wait until all netlink users drop the reference.
      
      [1] Call trace:
      kernfs: ns required in 'infiniband' for 'mlx5_0'
      WARNING: CPU: 18 PID: 44270 at fs/kernfs/dir.c:842 kernfs_find_ns+0x104/0x120
      libahci i2c_core mlxfw libata dca [last unloaded: devlink]
      RIP: 0010:kernfs_find_ns+0x104/0x120
      Call Trace:
      kernfs_find_and_get_ns+0x2e/0x50
      sysfs_rename_link_ns+0x40/0xb0
      device_rename+0xb2/0xf0
      ib_device_rename+0xb3/0x100 [ib_core]
      nldev_set_doit+0x165/0x190 [ib_core]
      rdma_nl_rcv_msg+0x249/0x250 [ib_core]
      ? netlink_deliver_tap+0x8f/0x3e0
      rdma_nl_rcv+0xd6/0x120 [ib_core]
      netlink_unicast+0x17c/0x230
      netlink_sendmsg+0x2f0/0x3e0
      sock_sendmsg+0x30/0x40
      __sys_sendto+0xdc/0x160
      
      Fixes: da5c8507 ("RDMA/nldev: add driver-specific resource tracking")
      Signed-off-by: 's avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: 's avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: 's avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      f9acb020
    • Saeed Mahameed's avatar
      net/mlx5: EQ, Use the right place to store/read IRQ affinity hint · 57da3b1f
      Saeed Mahameed authored
      [ Upstream commit 1e86ace4 ]
      
      Currently the cpu affinity hint mask for completion EQs is stored and
      read from the wrong place, since reading and storing is done from the
      same index, there is no actual issue with that, but internal irq_info
      for completion EQs stars at MLX5_EQ_VEC_COMP_BASE offset in irq_info
      array, this patch changes the code to use the correct offset to store
      and read the IRQ affinity hint.
      Signed-off-by: 's avatarSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: 's avatarLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: 's avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: 's avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      57da3b1f
    • Muchun Song's avatar
      gpiolib: Fix possible use after free on label · 4672971c
      Muchun Song authored
      [ Upstream commit 18534df4 ]
      
      gpiod_request_commit() copies the pointer to the label passed as
      an argument only to be used later. But there's a chance the caller
      could immediately free the passed string(e.g., local variable).
      This could trigger a use after free when we use gpio label(e.g.,
      gpiochip_unlock_as_irq(), gpiochip_is_requested()).
      
      To be on the safe side: duplicate the string with kstrdup_const()
      so that if an unaware user passes an address to a stack-allocated
      buffer, we won't get the arbitrary label.
      
      Also fix gpiod_set_consumer_name().
      Signed-off-by: 's avatarMuchun Song <smuchun@gmail.com>
      Signed-off-by: 's avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      4672971c
  3. 06 Feb, 2019 4 commits
    • Frank Rowand's avatar
      of: overlay: add tests to validate kfrees from overlay removal · 0aa65adc
      Frank Rowand authored
      commit 144552c7 upstream.
      
      Add checks:
        - attempted kfree due to refcount reaching zero before overlay
          is removed
        - properties linked to an overlay node when the node is removed
        - node refcount > one during node removal in a changeset destroy,
          if the node was created by the changeset
      
      After applying this patch, several validation warnings will be
      reported from the devicetree unittest during boot due to
      pre-existing devicetree bugs. The warnings will be similar to:
      
        OF: ERROR: of_node_release(), unexpected properties in /testcase-data/overlay-node/test-bus/test-unittest11
        OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /testcase-data-2/substation@100/
        hvac-medium-2
      Tested-by: 's avatarAlan Tull <atull@kernel.org>
      Signed-off-by: 's avatarFrank Rowand <frank.rowand@sony.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0aa65adc
    • Tetsuo Handa's avatar
      oom, oom_reaper: do not enqueue same task twice · 9f1a96b1
      Tetsuo Handa authored
      commit 9bcdeb51 upstream.
      
      Arkadiusz reported that enabling memcg's group oom killing causes
      strange memcg statistics where there is no task in a memcg despite the
      number of tasks in that memcg is not 0.  It turned out that there is a
      bug in wake_oom_reaper() which allows enqueuing same task twice which
      makes impossible to decrease the number of tasks in that memcg due to a
      refcount leak.
      
      This bug existed since the OOM reaper became invokable from
      task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
      
        T1@P1     |T2@P1     |T3@P1     |OOM reaper
        ----------+----------+----------+------------
                                         # Processing an OOM victim in a different memcg domain.
                              try_charge()
                                mem_cgroup_out_of_memory()
                                  mutex_lock(&oom_lock)
                   try_charge()
                     mem_cgroup_out_of_memory()
                       mutex_lock(&oom_lock)
        try_charge()
          mem_cgroup_out_of_memory()
            mutex_lock(&oom_lock)
                                  out_of_memory()
                                    oom_kill_process(P1)
                                      do_send_sig_info(SIGKILL, @p1)
                                      mark_oom_victim(T1@P1)
                                      wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                                  mutex_unlock(&oom_lock)
                       out_of_memory()
                         mark_oom_victim(T2@P1)
                         wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                       mutex_unlock(&oom_lock)
            out_of_memory()
              mark_oom_victim(T1@P1)
              wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
            mutex_unlock(&oom_lock)
                                         # Completed processing an OOM victim in a different memcg domain.
                                         spin_lock(&oom_reaper_lock)
                                         # T1P1 is dequeued.
                                         spin_unlock(&oom_reaper_lock)
      
      but memcg's group oom killing made it easier to trigger this bug by
      calling wake_oom_reaper() on the same task from one out_of_memory()
      request.
      
      Fix this bug using an approach used by commit 855b0183 ("oom,
      oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
      side effect of this patch, this patch also avoids enqueuing multiple
      threads sharing memory via task_will_free_mem(current) path.
      
      Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
      Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
      Fixes: af8e15cc ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
      Signed-off-by: 's avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: arekm's avatarArkadiusz Miskiewicz <arekm@maven.pl>
      Tested-by: arekm's avatarArkadiusz Miskiewicz <arekm@maven.pl>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Acked-by: 's avatarRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <asarai@suse.de>
      Cc: Jay Kamat <jgkamat@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f1a96b1
    • Dave Watson's avatar
      net: tls: Save iv in tls_rec for async crypto requests · cb77e08d
      Dave Watson authored
      [ Upstream commit 32eb67b9 ]
      
      aead_request_set_crypt takes an iv pointer, and we change the iv
      soon after setting it.  Some async crypto algorithms don't save the iv,
      so we need to save it in the tls_rec for async requests.
      
      Found by hardcoding x64 aesni to use async crypto manager (to test the async
      codepath), however I don't think this combination can happen in the wild.
      Presumably other hardware offloads will need this fix, but there have been
      no user reports.
      
      Fixes: a42055e8 ("Add support for async encryption of records...")
      Signed-off-by: 's avatarDave Watson <davejwatson@fb.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb77e08d
    • Daniel Borkmann's avatar
      ipvlan, l3mdev: fix broken l3s mode wrt local routes · 61d228f9
      Daniel Borkmann authored
      [ Upstream commit d5256083 ]
      
      While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
      I ran into the issue that while l3 mode is working fine, l3s mode
      does not have any connectivity to kube-apiserver and hence all pods
      end up in Error state as well. The ipvlan master device sits on
      top of a bond device and hostns traffic to kube-apiserver (also running
      in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
      where the latter is the address of the bond0. While in l3 mode, a
      curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
      works fine from hostns, neither of them do in case of l3s. In the
      latter only a curl to https://127.0.0.1:37573 appeared to work where
      for local addresses of bond0 I saw kernel suddenly starting to emit
      ARP requests to query HW address of bond0 which remained unanswered
      and neighbor entries in INCOMPLETE state. These ARP requests only
      happen while in l3s.
      
      Debugging this further, I found the issue is that l3s mode is piggy-
      backing on l3 master device, and in this case local routes are using
      l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
      f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev
      if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be
      a loopback"). I found that reverting them back into using the
      net->loopback_dev fixed ipvlan l3s connectivity and got everything
      working for the CNI.
      
      Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the
      l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
      on l3 master device is to get the l3mdev_ip_rcv() receive hook for
      setting the dst entry of the input route without adding its own
      ipvlan specific hacks into the receive path, however, any l3 domain
      semantics beyond just that are breaking l3s operation. Note that
      ipvlan also has the ability to dynamically switch its internal
      operation from l3 to l3s for all ports via ipvlan_set_port_mode()
      at runtime. In any case, l3 vs l3s soley distinguishes itself by
      'de-confusing' netfilter through switching skb->dev to ipvlan slave
      device late in NF_INET_LOCAL_IN before handing the skb to L4.
      
      Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
      if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
      without any additional l3mdev semantics on top. This should also have
      minimal impact since dev->priv_flags is already hot in cache. With
      this set, l3s mode is working fine and I also get things like
      masquerading pod traffic on the ipvlan master properly working.
      
        [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Martynas Pumputis <m@lambda.lt>
      Acked-by: 's avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61d228f9
  4. 31 Jan, 2019 12 commits
    • Deepa Dinamani's avatar
      Input: input_event - fix the CONFIG_SPARC64 mixup · 4c8e5815
      Deepa Dinamani authored
      commit 141e5dca upstream.
      
      Arnd Bergmann pointed out that CONFIG_* cannot be used in a uapi header.
      Override with an equivalent conditional.
      
      Fixes: 2e746942 ("Input: input_event - provide override for sparc64")
      Fixes: 152194fe ("Input: extend usable life of event timestamps to 2106 on 32 bit systems")
      Signed-off-by: 's avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: 's avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4c8e5815
    • Dexuan Cui's avatar
      Drivers: hv: vmbus: Remove the useless API vmbus_get_outgoing_channel() · e4266dff
      Dexuan Cui authored
      [ Upstream commit 4d3c5c69 ]
      
      Commit d86adf48 ("scsi: storvsc: Enable multi-queue support") removed
      the usage of the API in Jan 2017, and the API is not used since then.
      
      netvsc and storvsc have their own algorithms to determine the outgoing
      channel, so this API is useless.
      
      And the API is potentially unsafe, because it reads primary->num_sc without
      any lock held. This can be risky considering the RESCIND-OFFER message.
      
      Let's remove the API.
      
      Cc: Long Li <longli@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: 's avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: 's avatarK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: 's avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      e4266dff
    • Daniel Borkmann's avatar
      bpf: fix sanitation of alu op with pointer / scalar type from different paths · 4bce22c3
      Daniel Borkmann authored
      [ commit d3bd7413 upstream ]
      
      While 979d63d5 ("bpf: prevent out of bounds speculation on pointer
      arithmetic") took care of rejecting alu op on pointer when e.g. pointer
      came from two different map values with different map properties such as
      value size, Jann reported that a case was not covered yet when a given
      alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
      different branches where we would incorrectly try to sanitize based
      on the pointer's limit. Catch this corner case and reject the program
      instead.
      
      Fixes: 979d63d5 ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Reported-by: 's avatarJann Horn <jannh@google.com>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      4bce22c3
    • Daniel Borkmann's avatar
      bpf: prevent out of bounds speculation on pointer arithmetic · 078da99d
      Daniel Borkmann authored
      [ commit 979d63d5 upstream ]
      
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
      example:
      
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      speculation.
      
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
      
        PTR += 0x1000 (e.g. K-based imm)
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        PTR += 0x1000
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        [...]
      
      ... which under speculation could end up as ...
      
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        [...]
      
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
      
        # bpftool prog dump xlated id 282
        [...]
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
        [...]
      
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
      
        [...]
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
        [...]
      
      JIT blinding example with non-conflicting use of r10:
      
        [...]
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
       [...]
      
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
      
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
            https://arxiv.org/pdf/1801.04084.pdf
      
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
            https://arxiv.org/pdf/1811.05441.pdf
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: 's avatarJann Horn <jannh@google.com>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      078da99d
    • Daniel Borkmann's avatar
      bpf: enable access to ax register also from verifier rewrite · 74d3c044
      Daniel Borkmann authored
      [ commit 9b73bfdd upstream ]
      
      Right now we are using BPF ax register in JIT for constant blinding as
      well as in interpreter as temporary variable. Verifier will not be able
      to use it simply because its use will get overridden from the former in
      bpf_jit_blind_insn(). However, it can be made to work in that blinding
      will be skipped if there is prior use in either source or destination
      register on the instruction. Taking constraints of ax into account, the
      verifier is then open to use it in rewrites under some constraints. Note,
      ax register already has mappings in every eBPF JIT.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      74d3c044
    • Daniel Borkmann's avatar
      bpf: move tmp variable into ax register in interpreter · 433303ac
      Daniel Borkmann authored
      [ commit 144cd91c upstream ]
      
      This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
      into the hidden ax register. The latter is currently only used in JITs
      for constant blinding as a temporary scratch register, meaning the BPF
      interpreter will never see the use of ax. Therefore it is safe to use
      it for the cases where tmp has been used earlier. This is needed to later
      on allow restricted hidden use of ax in both interpreter and JITs.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      433303ac
    • Daniel Borkmann's avatar
      bpf: move {prev_,}insn_idx into verifier env · 629b8af1
      Daniel Borkmann authored
      [ commit c08435ec upstream ]
      
      Move prev_insn_idx and insn_idx from the do_check() function into
      the verifier environment, so they can be read inside the various
      helper functions for handling the instructions. It's easier to put
      this into the environment rather than changing all call-sites only
      to pass it along. insn_idx is useful in particular since this later
      on allows to hold state in env->insn_aux_data[env->insn_idx].
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      629b8af1
    • Deepa Dinamani's avatar
      Input: input_event - provide override for sparc64 · 16669bbb
      Deepa Dinamani authored
      commit 2e746942 upstream.
      
      The usec part of the timeval is defined as
      __kernel_suseconds_t	tv_usec; /* microseconds */
      
      Arnd noticed that sparc64 is the only architecture that defines
      __kernel_suseconds_t as int rather than long.
      
      This breaks the current y2038 fix for kernel as we only access and define
      the timeval struct for non-kernel use cases.  But, this was hidden by an
      another typo in the use of __KERNEL__ qualifier.
      
      Fix the typo, and provide an override for sparc64.
      
      Fixes: 152194fe ("Input: extend usable life of event timestamps to 2106 on 32 bit systems")
      Reported-by: 's avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: 's avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16669bbb
    • Dexuan Cui's avatar
      Drivers: hv: vmbus: Check for ring when getting debug info · 7e66208d
      Dexuan Cui authored
      commit ba50bf1c upstream.
      
      fc96df16 is good and can already fix the "return stack garbage" issue,
      but let's also improve hv_ringbuffer_get_debuginfo(), which would silently
      return stack garbage, if people forget to check channel->state or
      ring_info->ring_buffer, when using the function in the future.
      
      Having an error check in the function would eliminate the potential risk.
      
      Add a Fixes tag to indicate the patch depdendency.
      
      Fixes: fc96df16 ("Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels")
      Cc: stable@vger.kernel.org
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: 's avatarStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: 's avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7e66208d
    • Ido Schimmel's avatar
      net: ipv4: Fix memory leak in network namespace dismantle · c05bd0be
      Ido Schimmel authored
      [ Upstream commit f97f4dd8 ]
      
      IPv4 routing tables are flushed in two cases:
      
      1. In response to events in the netdev and inetaddr notification chains
      2. When a network namespace is being dismantled
      
      In both cases only routes associated with a dead nexthop group are
      flushed. However, a nexthop group will only be marked as dead in case it
      is populated with actual nexthops using a nexthop device. This is not
      the case when the route in question is an error route (e.g.,
      'blackhole', 'unreachable').
      
      Therefore, when a network namespace is being dismantled such routes are
      not flushed and leaked [1].
      
      To reproduce:
      # ip netns add blue
      # ip -n blue route add unreachable 192.0.2.0/24
      # ip netns del blue
      
      Fix this by not skipping error routes that are not marked with
      RTNH_F_DEAD when flushing the routing tables.
      
      To prevent the flushing of such routes in case #1, add a parameter to
      fib_table_flush() that indicates if the table is flushed as part of
      namespace dismantle or not.
      
      Note that this problem does not exist in IPv6 since error routes are
      associated with the loopback device.
      
      [1]
      unreferenced object 0xffff888066650338 (size 56):
        comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff  ..........ba....
          e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00  ...d............
        backtrace:
          [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
          [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
          [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
          [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
          [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
          [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
          [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
          [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
          [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
          [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000003a8b605b>] 0xffffffffffffffff
      unreferenced object 0xffff888061621c88 (size 48):
        comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
        hex dump (first 32 bytes):
          6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
          6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff  kkkkkkkk..&_....
        backtrace:
          [<00000000733609e3>] fib_table_insert+0x978/0x1500
          [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
          [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
          [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
          [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
          [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
          [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
          [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
          [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
          [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
          [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000003a8b605b>] 0xffffffffffffffff
      
      Fixes: 8cced9ef ("[NETNS]: Enable routing configuration in non-initial namespace.")
      Signed-off-by: 's avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: 's avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c05bd0be
    • Camelia Groza's avatar
      net: phy: phy driver features are mandatory · 850d8483
      Camelia Groza authored
      [ Upstream commit 3e64cf7a ]
      
      Since phy driver features became a link_mode bitmap, phy drivers that
      don't have a list of features configured will cause the kernel to crash
      when probed.
      
      Prevent the phy driver from registering if the features field is missing.
      
      Fixes: 719655a1 ("net: phy: Replace phy driver features u32 with link_mode bitmap")
      Reported-by: 's avatarScott Wood <oss@buserror.net>
      Signed-off-by: 's avatarCamelia Groza <camelia.groza@nxp.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      850d8483
    • Ross Lagerwall's avatar
      net: Fix usage of pskb_trim_rcsum · 38c00a18
      Ross Lagerwall authored
      [ Upstream commit 6c57f045 ]
      
      In certain cases, pskb_trim_rcsum() may change skb pointers.
      Reinitialize header pointers afterwards to avoid potential
      use-after-frees. Add a note in the documentation of
      pskb_trim_rcsum(). Found by KASAN.
      Signed-off-by: 's avatarRoss Lagerwall <ross.lagerwall@citrix.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      38c00a18
  5. 26 Jan, 2019 7 commits
    • Qian Cai's avatar
      mm/memblock.c: skip kmemleak for kasan_init() · 051b746c
      Qian Cai authored
      [ Upstream commit fed84c78 ]
      
      Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
      Huawei TaiShan 2280 aarch64 servers).
      
      After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
      log buffer went from something like 280 to 260000 which caused kmemleak
      disabled and crash dump memory reservation failed.  The multitude of
      kmemleak_alloc() calls is from nested loops while KASAN is setting up full
      memory mappings, so let early kmemleak allocations skip those
      memblock_alloc_internal() calls came from kasan_init() given that those
      early KASAN memory mappings should not reference to other memory.  Hence,
      no kmemleak false positives.
      
      kasan_init
        kasan_map_populate [1]
          kasan_pgd_populate [2]
            kasan_pud_populate [3]
              kasan_pmd_populate [4]
                kasan_pte_populate [5]
                  kasan_alloc_zeroed_page
                    memblock_alloc_try_nid
                      memblock_alloc_internal
                        kmemleak_alloc
      
      [1] for_each_memblock(memory, reg)
      [2] while (pgdp++, addr = next, addr != end)
      [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
      [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
      [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))
      
      Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.usSigned-off-by: 's avatarQian Cai <cai@gmx.us>
      Acked-by: 's avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      051b746c
    • Aaron Lu's avatar
      mm/swap: use nr_node_ids for avail_lists in swap_info_struct · c60ecfce
      Aaron Lu authored
      [ Upstream commit 66f71da9 ]
      
      Since a2468cc9 ("swap: choose swap device according to numa node"),
      avail_lists field of swap_info_struct is changed to an array with
      MAX_NUMNODES elements.  This made swap_info_struct size increased to 40KiB
      and needs an order-4 page to hold it.
      
      This is not optimal in that:
      1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
        is a waste of memory;
      2 It could cause swapon failure if the swap device is swapped on
        after system has been running for a while, due to no order-4
        page is available as pointed out by Vasily Averin.
      
      Solve the above two issues by using nr_node_ids(which is the actual
      possible node number the running system has) for avail_lists instead of
      MAX_NUMNODES.
      
      nr_node_ids is unknown at compile time so can't be directly used when
      declaring this array.  What I did here is to declare avail_lists as zero
      element array and allocate space for it when allocating space for
      swap_info_struct.  The reason why keep using array but not pointer is
      plist_for_each_entry needs the field to be part of the struct, so pointer
      will not work.
      
      This patch is on top of Vasily Averin's fix commit.  I think the use of
      kvzalloc for swap_info_struct is still needed in case nr_node_ids is
      really big on some systems.
      
      Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.comSigned-off-by: 's avatarAaron Lu <aaron.lu@intel.com>
      Reviewed-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      c60ecfce
    • Bart Van Assche's avatar
      scsi: target/core: Make sure that target_wait_for_sess_cmds() waits long enough · 6c05aea6
      Bart Van Assche authored
      [ Upstream commit ad669505 ]
      
      A session must only be released after all code that accesses the session
      structure has finished. Make sure that this is the case by introducing a
      new command counter per session that is only decremented after the
      .release_cmd() callback has finished. This patch fixes the following crash:
      
      BUG: KASAN: use-after-free in do_raw_spin_lock+0x1c/0x130
      Read of size 4 at addr ffff8801534b16e4 by task rmdir/14805
      CPU: 16 PID: 14805 Comm: rmdir Not tainted 4.18.0-rc2-dbg+ #5
      Call Trace:
      dump_stack+0xa4/0xf5
      print_address_description+0x6f/0x270
      kasan_report+0x241/0x360
      __asan_load4+0x78/0x80
      do_raw_spin_lock+0x1c/0x130
      _raw_spin_lock_irqsave+0x52/0x60
      srpt_set_ch_state+0x27/0x70 [ib_srpt]
      srpt_disconnect_ch+0x1b/0xc0 [ib_srpt]
      srpt_close_session+0xa8/0x260 [ib_srpt]
      target_shutdown_sessions+0x170/0x180 [target_core_mod]
      core_tpg_del_initiator_node_acl+0xf3/0x200 [target_core_mod]
      target_fabric_nacl_base_release+0x25/0x30 [target_core_mod]
      config_item_release+0x9c/0x110 [configfs]
      config_item_put+0x26/0x30 [configfs]
      configfs_rmdir+0x3b8/0x510 [configfs]
      vfs_rmdir+0xb3/0x1e0
      do_rmdir+0x262/0x2c0
      do_syscall_64+0x77/0x230
      entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Cc: Nicholas Bellinger <nab@linux-iscsi.org>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Disseldorp <ddiss@suse.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Signed-off-by: 's avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: Martin K. Petersen's avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      6c05aea6
    • Andrey Ignatov's avatar
      bpf: Allow narrow loads with offset > 0 · 29a28ec5
      Andrey Ignatov authored
      [ Upstream commit 46f53a65 ]
      
      Currently BPF verifier allows narrow loads for a context field only with
      offset zero. E.g. if there is a __u32 field then only the following
      loads are permitted:
        * off=0, size=1 (narrow);
        * off=0, size=2 (narrow);
        * off=0, size=4 (full).
      
      On the other hand LLVM can generate a load with offset different than
      zero that make sense from program logic point of view, but verifier
      doesn't accept it.
      
      E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code:
      
        #define DST_IP4			0xC0A801FEU /* 192.168.1.254 */
        ...
        	if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) &&
      
      where ctx is struct bpf_sock_addr.
      
      Some versions of LLVM can produce the following byte code for it:
      
             8:       71 12 07 00 00 00 00 00         r2 = *(u8 *)(r1 + 7)
             9:       67 02 00 00 18 00 00 00         r2 <<= 24
            10:       18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00         r3 = 4261412864 ll
            12:       5d 32 07 00 00 00 00 00         if r2 != r3 goto +7 <LBB0_6>
      
      where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1
      and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently
      rejected by verifier.
      
      Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok()
      what means any is_valid_access implementation, that uses the function,
      works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or
      sock_addr_is_valid_access() for bpf_sock_addr.
      
      The patch makes such loads supported. Offset can be in [0; size_default)
      but has to be multiple of load size. E.g. for __u32 field the following
      loads are supported now:
        * off=0, size=1 (narrow);
        * off=1, size=1 (narrow);
        * off=2, size=1 (narrow);
        * off=3, size=1 (narrow);
        * off=0, size=2 (narrow);
        * off=2, size=2 (narrow);
        * off=0, size=4 (full).
      Reported-by: 's avatarYonghong Song <yhs@fb.com>
      Signed-off-by: 's avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      29a28ec5
    • Anders Roxell's avatar
      writeback: don't decrement wb->refcnt if !wb->bdi · 8a74670d
      Anders Roxell authored
      [ Upstream commit 347a28b5 ]
      
      This happened while running in qemu-system-aarch64, the AMBA PL011 UART
      driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE.
      arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init),
      devtmpfs' handle_remove() crashes because the reference count is a NULL
      pointer only because wb->bdi hasn't been initialized yet.
      
      Rework so that wb_put have an extra check if wb->bdi before decrement
      wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens again
      in other drivers.
      
      Fixes: 52ebea74 ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
      Co-developed-by: 's avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: 's avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: 's avatarAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      8a74670d
    • Badhri Jagan Sridharan's avatar
      usb: typec: tcpm: Do not disconnect link for self powered devices · 5fca7488
      Badhri Jagan Sridharan authored
      [ Upstream commit 23b5f732 ]
      
      During HARD_RESET the data link is disconnected.
      For self powered device, the spec is advising against doing that.
      
      >From USB_PD_R3_0
      7.1.5 Response to Hard Resets
      Device operation during and after a Hard Reset is defined as follows:
      Self-powered devices Should Not disconnect from USB during a Hard Reset
      (see Section 9.1.2).
      Bus powered devices will disconnect from USB during a Hard Reset due to the
      loss of their power source.
      
      Tackle this by letting TCPM know whether the device is self or bus powered.
      
      This overcomes unnecessary port disconnections from hard reset.
      Also, speeds up the enumeration time when connected to Type-A ports.
      Signed-off-by: 's avatarBadhri Jagan Sridharan <badhri@google.com>
      Reviewed-by: 's avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      ---------
      Version history:
      V3:
      Rebase on top of usb-next
      
      V2:
      Based on feedback from heikki.krogerus@linux.intel.com
      - self_powered added to the struct tcpm_port which is populated from
        a. "connector" node of the device tree in tcpm_fw_get_caps()
        b. "self_powered" node of the tcpc_config in tcpm_copy_caps
      
      Based on feedbase from linux@roeck-us.net
      - Code was refactored
      - SRC_HARD_RESET_VBUS_OFF sets the link state to false based
        on self_powered flag
      
      V1 located here:
      https://lkml.org/lkml/2018/9/13/94Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      5fca7488
    • Arnd Bergmann's avatar
      ASoC: wm97xx: fix uninitialized regmap pointer problem · 47b6173b
      Arnd Bergmann authored
      [ Upstream commit 576ce407 ]
      
      gcc notices that without either the ac97 bus or the pdata, we never
      initialize the regmap pointer, which leads to an uninitialized variable
      access:
      
      sound/soc/codecs/wm9712.c: In function 'wm9712_soc_probe':
      sound/soc/codecs/wm9712.c:666:2: error: 'regmap' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      Since that configuration is invalid, it's better to return an error
      here. I tried to avoid adding complexity to the conditions, and turned
      the #ifdef into a regular if(IS_ENABLED()) check for readability.
      This in turn requires moving some header file declarations out of
      an #ifdef.
      
      The same code is used in three drivers, all of which I'm changing
      the same way.
      
      Fixes: 2ed1a8e0 ("ASoC: wm9712: add ac97 new bus support")
      Signed-off-by: 's avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: 's avatarMark Brown <broonie@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      47b6173b
  6. 22 Jan, 2019 5 commits
    • Yufen Yu's avatar
      block: use rcu_work instead of call_rcu to avoid sleep in softirq · f3631a8b
      Yufen Yu authored
      commit 94a2c3a3 upstream.
      
      We recently got a stack by syzkaller like this:
      
      BUG: sleeping function called from invalid context at mm/slab.h:361
      in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid
      INFO: lockdep is turned off.
      CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
       0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194
       0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00
       0000000000000000 0000000000000001 0000000000000004 0000000000000000
      Call Trace:
       <IRQ>  [<ffffffff81cb2194>] __dump_stack lib/dump_stack.c:15 [inline]
       <IRQ>  [<ffffffff81cb2194>] dump_stack+0x114/0x1a0 lib/dump_stack.c:51
       [<ffffffff8129a981>] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675
       [<ffffffff8129ac33>] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637
       [<ffffffff81794c13>] slab_pre_alloc_hook mm/slab.h:361 [inline]
       [<ffffffff81794c13>] slab_alloc_node mm/slub.c:2610 [inline]
       [<ffffffff81794c13>] slab_alloc mm/slub.c:2692 [inline]
       [<ffffffff81794c13>] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709
       [<ffffffff81cbe9a7>] kmalloc include/linux/slab.h:479 [inline]
       [<ffffffff81cbe9a7>] kzalloc include/linux/slab.h:623 [inline]
       [<ffffffff81cbe9a7>] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227
       [<ffffffff81cbf84f>] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374
       [<ffffffff81cbb5b9>] kobject_cleanup lib/kobject.c:633 [inline]
       [<ffffffff81cbb5b9>] kobject_release+0x229/0x440 lib/kobject.c:675
       [<ffffffff81cbb0a2>] kref_sub include/linux/kref.h:73 [inline]
       [<ffffffff81cbb0a2>] kref_put include/linux/kref.h:98 [inline]
       [<ffffffff81cbb0a2>] kobject_put+0x72/0xd0 lib/kobject.c:692
       [<ffffffff8216f095>] put_device+0x25/0x30 drivers/base/core.c:1237
       [<ffffffff81c4cc34>] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232
       [<ffffffff813c08bc>] __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
       [<ffffffff813c08bc>] rcu_do_batch kernel/rcu/tree.c:2705 [inline]
       [<ffffffff813c08bc>] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline]
       [<ffffffff813c08bc>] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline]
       [<ffffffff813c08bc>] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957
       [<ffffffff8120f509>] __do_softirq+0x299/0xe20 kernel/softirq.c:273
       [<ffffffff81210496>] invoke_softirq kernel/softirq.c:350 [inline]
       [<ffffffff81210496>] irq_exit+0x216/0x2c0 kernel/softirq.c:391
       [<ffffffff82c2cd7b>] exiting_irq arch/x86/include/asm/apic.h:652 [inline]
       [<ffffffff82c2cd7b>] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926
       [<ffffffff82c2bc25>] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746
       <EOI>  [<ffffffff814cbf40>] ? audit_kill_trees+0x180/0x180
       [<ffffffff8187d2f7>] fd_install+0x57/0x80 fs/file.c:626
       [<ffffffff8180989e>] do_sys_open+0x45e/0x550 fs/open.c:1043
       [<ffffffff818099c2>] SYSC_open fs/open.c:1055 [inline]
       [<ffffffff818099c2>] SyS_open+0x32/0x40 fs/open.c:1050
       [<ffffffff82c299e1>] entry_SYSCALL_64_fastpath+0x1e/0x9a
      
      In softirq context, we call rcu callback function delete_partition_rcu_cb(),
      which may allocate memory by kzalloc with GFP_KERNEL flag. If the
      allocation cannot be satisfied, it may sleep. However, That is not allowed
      in softirq contex.
      
      Although we found this problem on linux 4.4, the latest kernel version
      seems to have this problem as well. And it is very similar to the
      previous one:
      	https://lkml.org/lkml/2018/7/9/391
      
      Fix it by using RCU workqueue, which allows sleep.
      Reviewed-by: 's avatarPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: 's avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f3631a8b
    • Adit Ranadive's avatar
      RDMA/vmw_pvrdma: Return the correct opcode when creating WR · 6c6f0b13
      Adit Ranadive authored
      commit 6325e01b upstream.
      
      Since the IB_WR_REG_MR opcode value changed, let's set the PVRDMA device
      opcodes explicitly.
      Reported-by: 's avatarRuishuang Wang <ruishuangw@vmware.com>
      Fixes: 9a59739b ("IB/rxe: Revise the ib_wr_opcode enum")
      Cc: stable@vger.kernel.org
      Reviewed-by: 's avatarBryan Tan <bryantan@vmware.com>
      Reviewed-by: 's avatarRuishuang Wang <ruishuangw@vmware.com>
      Reviewed-by: 's avatarVishnu Dasa <vdasa@vmware.com>
      Signed-off-by: 's avatarAdit Ranadive <aditr@vmware.com>
      Signed-off-by: 's avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6c6f0b13
    • Rafał Miłecki's avatar
      MIPS: BCM47XX: Setup struct device for the SoC · 23b14e74
      Rafał Miłecki authored
      commit 321c46b9 upstream.
      
      So far we never had any device registered for the SoC. This resulted in
      some small issues that we kept ignoring like:
      1) Not working GPIOLIB_IRQCHIP (gpiochip_irqchip_add_key() failing)
      2) Lack of proper tree in the /sys/devices/
      3) mips_dma_alloc_coherent() silently handling empty coherent_dma_mask
      
      Kernel 4.19 came with a lot of DMA changes and caused a regression on
      bcm47xx. Starting with the commit f8c55dc6 ("MIPS: use generic dma
      noncoherent ops for simple noncoherent platforms") DMA coherent
      allocations just fail. Example:
      [    1.114914] bgmac_bcma bcma0:2: Allocation of TX ring 0x200 failed
      [    1.121215] bgmac_bcma bcma0:2: Unable to alloc memory for DMA
      [    1.127626] bgmac_bcma: probe of bcma0:2 failed with error -12
      [    1.133838] bgmac_bcma: Broadcom 47xx GBit MAC driver loaded
      
      The bgmac driver also triggers a WARNING:
      [    0.959486] ------------[ cut here ]------------
      [    0.964387] WARNING: CPU: 0 PID: 1 at ./include/linux/dma-mapping.h:516 bgmac_enet_probe+0x1b4/0x5c4
      [    0.973751] Modules linked in:
      [    0.976913] CPU: 0 PID: 1 Comm: swapper Not tainted 4.19.9 #0
      [    0.982750] Stack : 804a0000 804597c4 00000000 00000000 80458fd8 8381bc2c 838282d4 80481a47
      [    0.991367]         8042e3ec 00000001 804d38f0 00000204 83980000 00000065 8381bbe0 6f55b24f
      [    0.999975]         00000000 00000000 80520000 00002018 00000000 00000075 00000007 00000000
      [    1.008583]         00000000 80480000 000ee811 00000000 00000000 00000000 80432c00 80248db8
      [    1.017196]         00000009 00000204 83980000 803ad7b0 00000000 801feeec 00000000 804d0000
      [    1.025804]         ...
      [    1.028325] Call Trace:
      [    1.030875] [<8000aef8>] show_stack+0x58/0x100
      [    1.035513] [<8001f8b4>] __warn+0xe4/0x118
      [    1.039708] [<8001f9a4>] warn_slowpath_null+0x48/0x64
      [    1.044935] [<80248db8>] bgmac_enet_probe+0x1b4/0x5c4
      [    1.050101] [<802498e0>] bgmac_probe+0x558/0x590
      [    1.054906] [<80252fd0>] bcma_device_probe+0x38/0x70
      [    1.060017] [<8020e1e8>] really_probe+0x170/0x2e8
      [    1.064891] [<8020e714>] __driver_attach+0xa4/0xec
      [    1.069784] [<8020c1e0>] bus_for_each_dev+0x58/0xb0
      [    1.074833] [<8020d590>] bus_add_driver+0xf8/0x218
      [    1.079731] [<8020ef24>] driver_register+0xcc/0x11c
      [    1.084804] [<804b54cc>] bgmac_init+0x1c/0x44
      [    1.089258] [<8000121c>] do_one_initcall+0x7c/0x1a0
      [    1.094343] [<804a1d34>] kernel_init_freeable+0x150/0x218
      [    1.099886] [<803a082c>] kernel_init+0x10/0x104
      [    1.104583] [<80005878>] ret_from_kernel_thread+0x14/0x1c
      [    1.110107] ---[ end trace f441c0d873d1fb5b ]---
      
      This patch setups a "struct device" (and passes it to the bcma) which
      allows fixing all the mentioned problems. It'll also require a tiny bcma
      patch which will follow through the wireless tree & its maintainer.
      
      Fixes: f8c55dc6 ("MIPS: use generic dma noncoherent ops for simple noncoherent platforms")
      Signed-off-by: 's avatarRafał Miłecki <rafal@milecki.pl>
      Signed-off-by: 's avatarPaul Burton <paul.burton@mips.com>
      Acked-by: Hauke Mehrtens's avatarHauke Mehrtens <hauke@hauke-m.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: linux-wireless@vger.kernel.org
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: linux-mips@linux-mips.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23b14e74
    • Greg Kroah-Hartman's avatar
      IN_BADCLASS: fix macro to actually work · a6ab2ac9
      Greg Kroah-Hartman authored
      [ Upstream commit f275ee0f ]
      
      Commit 65cab850 ("net: Allow class-e address assignment via ifconfig
      ioctl") modified the IN_BADCLASS macro a bit, but unfortunatly one too
      many '(' characters were added to the line, making any code that used
      it, not build properly.
      
      Also, the macro now compares an unsigned with a signed value, which
      isn't ok, so fix that up by making both types match properly.
      Reported-by: 's avatarChristopher Ferris <cferris@google.com>
      Fixes: 65cab850 ("net: Allow class-e address assignment via ifconfig ioctl")
      Cc: Dave Taht <dave.taht@gmail.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6ab2ac9
    • Andrew Lunn's avatar
      net: phy: Add missing features to PHY drivers · 702ee831
      Andrew Lunn authored
      [ Upstream commit 9e857a40 ]
      
      The bcm87xx and micrel driver has PHYs which are missing the .features
      value. Add them. The bcm87xx is a 10G FEC only PHY. Add the needed
      features definition of this PHY.
      
      Fixes: 719655a1 ("net: phy: Replace phy driver features u32 with link_mode bitmap")
      Reported-by: 's avatarScott Wood <oss@buserror.net>
      Reported-by: 's avatarCamelia Groza <camelia.groza@nxp.com>
      Signed-off-by: 's avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      702ee831