1. 31 Jan, 2019 10 commits
    • Daniel Borkmann's avatar
      bpf: fix sanitation of alu op with pointer / scalar type from different paths · 4bce22c3
      Daniel Borkmann authored
      [ commit d3bd7413 upstream ]
      While 979d63d5 ("bpf: prevent out of bounds speculation on pointer
      arithmetic") took care of rejecting alu op on pointer when e.g. pointer
      came from two different map values with different map properties such as
      value size, Jann reported that a case was not covered yet when a given
      alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
      different branches where we would incorrectly try to sanitize based
      on the pointer's limit. Catch this corner case and reject the program
      Fixes: 979d63d5 ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    • Daniel Borkmann's avatar
      bpf: prevent out of bounds speculation on pointer arithmetic · 078da99d
      Daniel Borkmann authored
      [ commit 979d63d5 upstream ]
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
        PTR += 0x1000 (e.g. K-based imm)
        PTR += 0x1000
      ... which under speculation could end up as ...
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
        # bpftool prog dump xlated id 282
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
      JIT blinding example with non-conflicting use of r10:
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    • Daniel Borkmann's avatar
      bpf: enable access to ax register also from verifier rewrite · 74d3c044
      Daniel Borkmann authored
      [ commit 9b73bfdd upstream ]
      Right now we are using BPF ax register in JIT for constant blinding as
      well as in interpreter as temporary variable. Verifier will not be able
      to use it simply because its use will get overridden from the former in
      bpf_jit_blind_insn(). However, it can be made to work in that blinding
      will be skipped if there is prior use in either source or destination
      register on the instruction. Taking constraints of ax into account, the
      verifier is then open to use it in rewrites under some constraints. Note,
      ax register already has mappings in every eBPF JIT.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    • Daniel Borkmann's avatar
      bpf: move tmp variable into ax register in interpreter · 433303ac
      Daniel Borkmann authored
      [ commit 144cd91c upstream ]
      This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
      into the hidden ax register. The latter is currently only used in JITs
      for constant blinding as a temporary scratch register, meaning the BPF
      interpreter will never see the use of ax. Therefore it is safe to use
      it for the cases where tmp has been used earlier. This is needed to later
      on allow restricted hidden use of ax in both interpreter and JITs.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    • Daniel Borkmann's avatar
      bpf: move {prev_,}insn_idx into verifier env · 629b8af1
      Daniel Borkmann authored
      [ commit c08435ec upstream ]
      Move prev_insn_idx and insn_idx from the do_check() function into
      the verifier environment, so they can be read inside the various
      helper functions for handling the instructions. It's easier to put
      this into the environment rather than changing all call-sites only
      to pass it along. insn_idx is useful in particular since this later
      on allows to hold state in env->insn_aux_data[env->insn_idx].
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    • Deepa Dinamani's avatar
      Input: input_event - provide override for sparc64 · 16669bbb
      Deepa Dinamani authored
      commit 2e746942 upstream.
      The usec part of the timeval is defined as
      __kernel_suseconds_t	tv_usec; /* microseconds */
      Arnd noticed that sparc64 is the only architecture that defines
      __kernel_suseconds_t as int rather than long.
      This breaks the current y2038 fix for kernel as we only access and define
      the timeval struct for non-kernel use cases.  But, this was hidden by an
      another typo in the use of __KERNEL__ qualifier.
      Fix the typo, and provide an override for sparc64.
      Fixes: 152194fe ("Input: extend usable life of event timestamps to 2106 on 32 bit systems")
      Reported-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Dexuan Cui's avatar
      Drivers: hv: vmbus: Check for ring when getting debug info · 7e66208d
      Dexuan Cui authored
      commit ba50bf1c upstream.
      fc96df16 is good and can already fix the "return stack garbage" issue,
      but let's also improve hv_ringbuffer_get_debuginfo(), which would silently
      return stack garbage, if people forget to check channel->state or
      ring_info->ring_buffer, when using the function in the future.
      Having an error check in the function would eliminate the potential risk.
      Add a Fixes tag to indicate the patch depdendency.
      Fixes: fc96df16 ("Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels")
      Cc: stable@vger.kernel.org
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: default avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Ido Schimmel's avatar
      net: ipv4: Fix memory leak in network namespace dismantle · c05bd0be
      Ido Schimmel authored
      [ Upstream commit f97f4dd8 ]
      IPv4 routing tables are flushed in two cases:
      1. In response to events in the netdev and inetaddr notification chains
      2. When a network namespace is being dismantled
      In both cases only routes associated with a dead nexthop group are
      flushed. However, a nexthop group will only be marked as dead in case it
      is populated with actual nexthops using a nexthop device. This is not
      the case when the route in question is an error route (e.g.,
      'blackhole', 'unreachable').
      Therefore, when a network namespace is being dismantled such routes are
      not flushed and leaked [1].
      To reproduce:
      # ip netns add blue
      # ip -n blue route add unreachable
      # ip netns del blue
      Fix this by not skipping error routes that are not marked with
      RTNH_F_DEAD when flushing the routing tables.
      To prevent the flushing of such routes in case #1, add a parameter to
      fib_table_flush() that indicates if the table is flushed as part of
      namespace dismantle or not.
      Note that this problem does not exist in IPv6 since error routes are
      associated with the loopback device.
      unreferenced object 0xffff888066650338 (size 56):
        comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff  ..........ba....
          e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00  ...d............
          [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
          [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
          [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
          [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
          [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
          [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
          [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
          [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
          [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
          [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000003a8b605b>] 0xffffffffffffffff
      unreferenced object 0xffff888061621c88 (size 48):
        comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
        hex dump (first 32 bytes):
          6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
          6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff  kkkkkkkk..&_....
          [<00000000733609e3>] fib_table_insert+0x978/0x1500
          [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
          [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
          [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
          [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
          [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
          [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
          [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
          [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
          [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
          [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000003a8b605b>] 0xffffffffffffffff
      Fixes: 8cced9ef ("[NETNS]: Enable routing configuration in non-initial namespace.")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Camelia Groza's avatar
      net: phy: phy driver features are mandatory · 850d8483
      Camelia Groza authored
      [ Upstream commit 3e64cf7a ]
      Since phy driver features became a link_mode bitmap, phy drivers that
      don't have a list of features configured will cause the kernel to crash
      when probed.
      Prevent the phy driver from registering if the features field is missing.
      Fixes: 719655a1 ("net: phy: Replace phy driver features u32 with link_mode bitmap")
      Reported-by: default avatarScott Wood <oss@buserror.net>
      Signed-off-by: default avatarCamelia Groza <camelia.groza@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Ross Lagerwall's avatar
      net: Fix usage of pskb_trim_rcsum · 38c00a18
      Ross Lagerwall authored
      [ Upstream commit 6c57f045 ]
      In certain cases, pskb_trim_rcsum() may change skb pointers.
      Reinitialize header pointers afterwards to avoid potential
      use-after-frees. Add a note in the documentation of
      pskb_trim_rcsum(). Found by KASAN.
      Signed-off-by: default avatarRoss Lagerwall <ross.lagerwall@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  2. 26 Jan, 2019 7 commits
    • Qian Cai's avatar
      mm/memblock.c: skip kmemleak for kasan_init() · 051b746c
      Qian Cai authored
      [ Upstream commit fed84c78 ]
      Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
      Huawei TaiShan 2280 aarch64 servers).
      After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
      log buffer went from something like 280 to 260000 which caused kmemleak
      disabled and crash dump memory reservation failed.  The multitude of
      kmemleak_alloc() calls is from nested loops while KASAN is setting up full
      memory mappings, so let early kmemleak allocations skip those
      memblock_alloc_internal() calls came from kasan_init() given that those
      early KASAN memory mappings should not reference to other memory.  Hence,
      no kmemleak false positives.
        kasan_map_populate [1]
          kasan_pgd_populate [2]
            kasan_pud_populate [3]
              kasan_pmd_populate [4]
                kasan_pte_populate [5]
      [1] for_each_memblock(memory, reg)
      [2] while (pgdp++, addr = next, addr != end)
      [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
      [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
      [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))
      Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.usSigned-off-by: default avatarQian Cai <cai@gmx.us>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    • Aaron Lu's avatar
      mm/swap: use nr_node_ids for avail_lists in swap_info_struct · c60ecfce
      Aaron Lu authored
      [ Upstream commit 66f71da9 ]
      Since a2468cc9 ("swap: choose swap device according to numa node"),
      avail_lists field of swap_info_struct is changed to an array with
      MAX_NUMNODES elements.  This made swap_info_struct size increased to 40KiB
      and needs an order-4 page to hold it.
      This is not optimal in that:
      1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
        is a waste of memory;
      2 It could cause swapon failure if the swap device is swapped on
        after system has been running for a while, due to no order-4
        page is available as pointed out by Vasily Averin.
      Solve the above two issues by using nr_node_ids(which is the actual
      possible node number the running system has) for avail_lists instead of
      nr_node_ids is unknown at compile time so can't be directly used when
      declaring this array.  What I did here is to declare avail_lists as zero
      element array and allocate space for it when allocating space for
      swap_info_struct.  The reason why keep using array but not pointer is
      plist_for_each_entry needs the field to be part of the struct, so pointer
      will not work.
      This patch is on top of Vasily Averin's fix commit.  I think the use of
      kvzalloc for swap_info_struct is still needed in case nr_node_ids is
      really big on some systems.
      Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.comSigned-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    • Bart Van Assche's avatar
      scsi: target/core: Make sure that target_wait_for_sess_cmds() waits long enough · 6c05aea6
      Bart Van Assche authored
      [ Upstream commit ad669505 ]
      A session must only be released after all code that accesses the session
      structure has finished. Make sure that this is the case by introducing a
      new command counter per session that is only decremented after the
      .release_cmd() callback has finished. This patch fixes the following crash:
      BUG: KASAN: use-after-free in do_raw_spin_lock+0x1c/0x130
      Read of size 4 at addr ffff8801534b16e4 by task rmdir/14805
      CPU: 16 PID: 14805 Comm: rmdir Not tainted 4.18.0-rc2-dbg+ #5
      Call Trace:
      srpt_set_ch_state+0x27/0x70 [ib_srpt]
      srpt_disconnect_ch+0x1b/0xc0 [ib_srpt]
      srpt_close_session+0xa8/0x260 [ib_srpt]
      target_shutdown_sessions+0x170/0x180 [target_core_mod]
      core_tpg_del_initiator_node_acl+0xf3/0x200 [target_core_mod]
      target_fabric_nacl_base_release+0x25/0x30 [target_core_mod]
      config_item_release+0x9c/0x110 [configfs]
      config_item_put+0x26/0x30 [configfs]
      configfs_rmdir+0x3b8/0x510 [configfs]
      Cc: Nicholas Bellinger <nab@linux-iscsi.org>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Disseldorp <ddiss@suse.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: Martin K. Petersen's avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    • Andrey Ignatov's avatar
      bpf: Allow narrow loads with offset > 0 · 29a28ec5
      Andrey Ignatov authored
      [ Upstream commit 46f53a65 ]
      Currently BPF verifier allows narrow loads for a context field only with
      offset zero. E.g. if there is a __u32 field then only the following
      loads are permitted:
        * off=0, size=1 (narrow);
        * off=0, size=2 (narrow);
        * off=0, size=4 (full).
      On the other hand LLVM can generate a load with offset different than
      zero that make sense from program logic point of view, but verifier
      doesn't accept it.
      E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code:
        #define DST_IP4			0xC0A801FEU /* */
        	if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) &&
      where ctx is struct bpf_sock_addr.
      Some versions of LLVM can produce the following byte code for it:
             8:       71 12 07 00 00 00 00 00         r2 = *(u8 *)(r1 + 7)
             9:       67 02 00 00 18 00 00 00         r2 <<= 24
            10:       18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00         r3 = 4261412864 ll
            12:       5d 32 07 00 00 00 00 00         if r2 != r3 goto +7 <LBB0_6>
      where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1
      and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently
      rejected by verifier.
      Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok()
      what means any is_valid_access implementation, that uses the function,
      works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or
      sock_addr_is_valid_access() for bpf_sock_addr.
      The patch makes such loads supported. Offset can be in [0; size_default)
      but has to be multiple of load size. E.g. for __u32 field the following
      loads are supported now:
        * off=0, size=1 (narrow);
        * off=1, size=1 (narrow);
        * off=2, size=1 (narrow);
        * off=3, size=1 (narrow);
        * off=0, size=2 (narrow);
        * off=2, size=2 (narrow);
        * off=0, size=4 (full).
      Reported-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    • Anders Roxell's avatar
      writeback: don't decrement wb->refcnt if !wb->bdi · 8a74670d
      Anders Roxell authored
      [ Upstream commit 347a28b5 ]
      This happened while running in qemu-system-aarch64, the AMBA PL011 UART
      driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE.
      arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init),
      devtmpfs' handle_remove() crashes because the reference count is a NULL
      pointer only because wb->bdi hasn't been initialized yet.
      Rework so that wb_put have an extra check if wb->bdi before decrement
      wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens again
      in other drivers.
      Fixes: 52ebea74 ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
      Co-developed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    • Badhri Jagan Sridharan's avatar
      usb: typec: tcpm: Do not disconnect link for self powered devices · 5fca7488
      Badhri Jagan Sridharan authored
      [ Upstream commit 23b5f732 ]
      During HARD_RESET the data link is disconnected.
      For self powered device, the spec is advising against doing that.
      >From USB_PD_R3_0
      7.1.5 Response to Hard Resets
      Device operation during and after a Hard Reset is defined as follows:
      Self-powered devices Should Not disconnect from USB during a Hard Reset
      (see Section 9.1.2).
      Bus powered devices will disconnect from USB during a Hard Reset due to the
      loss of their power source.
      Tackle this by letting TCPM know whether the device is self or bus powered.
      This overcomes unnecessary port disconnections from hard reset.
      Also, speeds up the enumeration time when connected to Type-A ports.
      Signed-off-by: default avatarBadhri Jagan Sridharan <badhri@google.com>
      Reviewed-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Version history:
      Rebase on top of usb-next
      Based on feedback from heikki.krogerus@linux.intel.com
      - self_powered added to the struct tcpm_port which is populated from
        a. "connector" node of the device tree in tcpm_fw_get_caps()
        b. "self_powered" node of the tcpc_config in tcpm_copy_caps
      Based on feedbase from linux@roeck-us.net
      - Code was refactored
      - SRC_HARD_RESET_VBUS_OFF sets the link state to false based
        on self_powered flag
      V1 located here:
      https://lkml.org/lkml/2018/9/13/94Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    • Arnd Bergmann's avatar
      ASoC: wm97xx: fix uninitialized regmap pointer problem · 47b6173b
      Arnd Bergmann authored
      [ Upstream commit 576ce407 ]
      gcc notices that without either the ac97 bus or the pdata, we never
      initialize the regmap pointer, which leads to an uninitialized variable
      sound/soc/codecs/wm9712.c: In function 'wm9712_soc_probe':
      sound/soc/codecs/wm9712.c:666:2: error: 'regmap' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      Since that configuration is invalid, it's better to return an error
      here. I tried to avoid adding complexity to the conditions, and turned
      the #ifdef into a regular if(IS_ENABLED()) check for readability.
      This in turn requires moving some header file declarations out of
      an #ifdef.
      The same code is used in three drivers, all of which I'm changing
      the same way.
      Fixes: 2ed1a8e0 ("ASoC: wm9712: add ac97 new bus support")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
  3. 22 Jan, 2019 7 commits
    • Yufen Yu's avatar
      block: use rcu_work instead of call_rcu to avoid sleep in softirq · f3631a8b
      Yufen Yu authored
      commit 94a2c3a3 upstream.
      We recently got a stack by syzkaller like this:
      BUG: sleeping function called from invalid context at mm/slab.h:361
      in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid
      INFO: lockdep is turned off.
      CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
       0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194
       0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00
       0000000000000000 0000000000000001 0000000000000004 0000000000000000
      Call Trace:
       <IRQ>  [<ffffffff81cb2194>] __dump_stack lib/dump_stack.c:15 [inline]
       <IRQ>  [<ffffffff81cb2194>] dump_stack+0x114/0x1a0 lib/dump_stack.c:51
       [<ffffffff8129a981>] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675
       [<ffffffff8129ac33>] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637
       [<ffffffff81794c13>] slab_pre_alloc_hook mm/slab.h:361 [inline]
       [<ffffffff81794c13>] slab_alloc_node mm/slub.c:2610 [inline]
       [<ffffffff81794c13>] slab_alloc mm/slub.c:2692 [inline]
       [<ffffffff81794c13>] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709
       [<ffffffff81cbe9a7>] kmalloc include/linux/slab.h:479 [inline]
       [<ffffffff81cbe9a7>] kzalloc include/linux/slab.h:623 [inline]
       [<ffffffff81cbe9a7>] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227
       [<ffffffff81cbf84f>] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374
       [<ffffffff81cbb5b9>] kobject_cleanup lib/kobject.c:633 [inline]
       [<ffffffff81cbb5b9>] kobject_release+0x229/0x440 lib/kobject.c:675
       [<ffffffff81cbb0a2>] kref_sub include/linux/kref.h:73 [inline]
       [<ffffffff81cbb0a2>] kref_put include/linux/kref.h:98 [inline]
       [<ffffffff81cbb0a2>] kobject_put+0x72/0xd0 lib/kobject.c:692
       [<ffffffff8216f095>] put_device+0x25/0x30 drivers/base/core.c:1237
       [<ffffffff81c4cc34>] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232
       [<ffffffff813c08bc>] __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
       [<ffffffff813c08bc>] rcu_do_batch kernel/rcu/tree.c:2705 [inline]
       [<ffffffff813c08bc>] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline]
       [<ffffffff813c08bc>] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline]
       [<ffffffff813c08bc>] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957
       [<ffffffff8120f509>] __do_softirq+0x299/0xe20 kernel/softirq.c:273
       [<ffffffff81210496>] invoke_softirq kernel/softirq.c:350 [inline]
       [<ffffffff81210496>] irq_exit+0x216/0x2c0 kernel/softirq.c:391
       [<ffffffff82c2cd7b>] exiting_irq arch/x86/include/asm/apic.h:652 [inline]
       [<ffffffff82c2cd7b>] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926
       [<ffffffff82c2bc25>] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746
       <EOI>  [<ffffffff814cbf40>] ? audit_kill_trees+0x180/0x180
       [<ffffffff8187d2f7>] fd_install+0x57/0x80 fs/file.c:626
       [<ffffffff8180989e>] do_sys_open+0x45e/0x550 fs/open.c:1043
       [<ffffffff818099c2>] SYSC_open fs/open.c:1055 [inline]
       [<ffffffff818099c2>] SyS_open+0x32/0x40 fs/open.c:1050
       [<ffffffff82c299e1>] entry_SYSCALL_64_fastpath+0x1e/0x9a
      In softirq context, we call rcu callback function delete_partition_rcu_cb(),
      which may allocate memory by kzalloc with GFP_KERNEL flag. If the
      allocation cannot be satisfied, it may sleep. However, That is not allowed
      in softirq contex.
      Although we found this problem on linux 4.4, the latest kernel version
      seems to have this problem as well. And it is very similar to the
      previous one:
      Fix it by using RCU workqueue, which allows sleep.
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Adit Ranadive's avatar
      RDMA/vmw_pvrdma: Return the correct opcode when creating WR · 6c6f0b13
      Adit Ranadive authored
      commit 6325e01b upstream.
      Since the IB_WR_REG_MR opcode value changed, let's set the PVRDMA device
      opcodes explicitly.
      Reported-by: default avatarRuishuang Wang <ruishuangw@vmware.com>
      Fixes: 9a59739b ("IB/rxe: Revise the ib_wr_opcode enum")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarBryan Tan <bryantan@vmware.com>
      Reviewed-by: default avatarRuishuang Wang <ruishuangw@vmware.com>
      Reviewed-by: default avatarVishnu Dasa <vdasa@vmware.com>
      Signed-off-by: default avatarAdit Ranadive <aditr@vmware.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Rafał Miłecki's avatar
      MIPS: BCM47XX: Setup struct device for the SoC · 23b14e74
      Rafał Miłecki authored
      commit 321c46b9 upstream.
      So far we never had any device registered for the SoC. This resulted in
      some small issues that we kept ignoring like:
      1) Not working GPIOLIB_IRQCHIP (gpiochip_irqchip_add_key() failing)
      2) Lack of proper tree in the /sys/devices/
      3) mips_dma_alloc_coherent() silently handling empty coherent_dma_mask
      Kernel 4.19 came with a lot of DMA changes and caused a regression on
      bcm47xx. Starting with the commit f8c55dc6 ("MIPS: use generic dma
      noncoherent ops for simple noncoherent platforms") DMA coherent
      allocations just fail. Example:
      [    1.114914] bgmac_bcma bcma0:2: Allocation of TX ring 0x200 failed
      [    1.121215] bgmac_bcma bcma0:2: Unable to alloc memory for DMA
      [    1.127626] bgmac_bcma: probe of bcma0:2 failed with error -12
      [    1.133838] bgmac_bcma: Broadcom 47xx GBit MAC driver loaded
      The bgmac driver also triggers a WARNING:
      [    0.959486] ------------[ cut here ]------------
      [    0.964387] WARNING: CPU: 0 PID: 1 at ./include/linux/dma-mapping.h:516 bgmac_enet_probe+0x1b4/0x5c4
      [    0.973751] Modules linked in:
      [    0.976913] CPU: 0 PID: 1 Comm: swapper Not tainted 4.19.9 #0
      [    0.982750] Stack : 804a0000 804597c4 00000000 00000000 80458fd8 8381bc2c 838282d4 80481a47
      [    0.991367]         8042e3ec 00000001 804d38f0 00000204 83980000 00000065 8381bbe0 6f55b24f
      [    0.999975]         00000000 00000000 80520000 00002018 00000000 00000075 00000007 00000000
      [    1.008583]         00000000 80480000 000ee811 00000000 00000000 00000000 80432c00 80248db8
      [    1.017196]         00000009 00000204 83980000 803ad7b0 00000000 801feeec 00000000 804d0000
      [    1.025804]         ...
      [    1.028325] Call Trace:
      [    1.030875] [<8000aef8>] show_stack+0x58/0x100
      [    1.035513] [<8001f8b4>] __warn+0xe4/0x118
      [    1.039708] [<8001f9a4>] warn_slowpath_null+0x48/0x64
      [    1.044935] [<80248db8>] bgmac_enet_probe+0x1b4/0x5c4
      [    1.050101] [<802498e0>] bgmac_probe+0x558/0x590
      [    1.054906] [<80252fd0>] bcma_device_probe+0x38/0x70
      [    1.060017] [<8020e1e8>] really_probe+0x170/0x2e8
      [    1.064891] [<8020e714>] __driver_attach+0xa4/0xec
      [    1.069784] [<8020c1e0>] bus_for_each_dev+0x58/0xb0
      [    1.074833] [<8020d590>] bus_add_driver+0xf8/0x218
      [    1.079731] [<8020ef24>] driver_register+0xcc/0x11c
      [    1.084804] [<804b54cc>] bgmac_init+0x1c/0x44
      [    1.089258] [<8000121c>] do_one_initcall+0x7c/0x1a0
      [    1.094343] [<804a1d34>] kernel_init_freeable+0x150/0x218
      [    1.099886] [<803a082c>] kernel_init+0x10/0x104
      [    1.104583] [<80005878>] ret_from_kernel_thread+0x14/0x1c
      [    1.110107] ---[ end trace f441c0d873d1fb5b ]---
      This patch setups a "struct device" (and passes it to the bcma) which
      allows fixing all the mentioned problems. It'll also require a tiny bcma
      patch which will follow through the wireless tree & its maintainer.
      Fixes: f8c55dc6 ("MIPS: use generic dma noncoherent ops for simple noncoherent platforms")
      Signed-off-by: default avatarRafał Miłecki <rafal@milecki.pl>
      Signed-off-by: default avatarPaul Burton <paul.burton@mips.com>
      Acked-by: Hauke Mehrtens's avatarHauke Mehrtens <hauke@hauke-m.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: linux-wireless@vger.kernel.org
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: linux-mips@linux-mips.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Greg Kroah-Hartman's avatar
      IN_BADCLASS: fix macro to actually work · a6ab2ac9
      Greg Kroah-Hartman authored
      [ Upstream commit f275ee0f ]
      Commit 65cab850 ("net: Allow class-e address assignment via ifconfig
      ioctl") modified the IN_BADCLASS macro a bit, but unfortunatly one too
      many '(' characters were added to the line, making any code that used
      it, not build properly.
      Also, the macro now compares an unsigned with a signed value, which
      isn't ok, so fix that up by making both types match properly.
      Reported-by: default avatarChristopher Ferris <cferris@google.com>
      Fixes: 65cab850 ("net: Allow class-e address assignment via ifconfig ioctl")
      Cc: Dave Taht <dave.taht@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Andrew Lunn's avatar
      net: phy: Add missing features to PHY drivers · 702ee831
      Andrew Lunn authored
      [ Upstream commit 9e857a40 ]
      The bcm87xx and micrel driver has PHYs which are missing the .features
      value. Add them. The bcm87xx is a 10G FEC only PHY. Add the needed
      features definition of this PHY.
      Fixes: 719655a1 ("net: phy: Replace phy driver features u32 with link_mode bitmap")
      Reported-by: default avatarScott Wood <oss@buserror.net>
      Reported-by: default avatarCamelia Groza <camelia.groza@nxp.com>
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Pablo Neira Ayuso's avatar
      netfilter: nf_conncount: speculative garbage collection on empty lists · 385c1e4b
      Pablo Neira Ayuso authored
      commit c80f10bc upstream.
      Instead of removing a empty list node that might be reintroduced soon
      thereafter, tentatively place the empty list node on the list passed to
      tree_nodes_free(), then re-check if the list is empty again before erasing
      it from the tree.
      [ Florian: rebase on top of pending nf_conncount fixes ]
      Fixes: 5c789e13 ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
      Reviewed-by: default avatarShawn Bohrer <sbohrer@cloudflare.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Florian Westphal's avatar
      netfilter: nf_conncount: merge lookup and add functions · 3409dd1d
      Florian Westphal authored
      commit df4a9025 upstream.
      'lookup' is always followed by 'add'.
      Merge both and make the list-walk part of nf_conncount_add().
      This also avoids one unneeded unlock/re-lock pair.
      Extra care needs to be taken in count_tree, as we only hold rcu
      read lock, i.e. we can only insert to an existing tree node after
      acquiring its lock and making sure it has a nonzero count.
      As a zero count should be rare, just fall back to insert_tree()
      (which acquires tree lock).
      This issue and its solution were pointed out by Shawn Bohrer
      during patch review.
      Reviewed-by: default avatarShawn Bohrer <sbohrer@cloudflare.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  4. 16 Jan, 2019 3 commits
    • Vasily Averin's avatar
      sunrpc: use-after-free in svc_process_common() · 696d76cc
      Vasily Averin authored
      commit d4b09acf upstream.
      if node have NFSv41+ mounts inside several net namespaces
      it can lead to use-after-free in svc_process_common()
              /* Setup reply header */
              rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp); <<< HERE
      svc_process_common() can use incorrect rqstp->rq_xprt,
      its caller function bc_svc_process() takes it from serv->sv_bc_xprt.
      The problem is that serv is global structure but sv_bc_xprt
      is assigned per-netnamespace.
      According to Trond, the whole "let's set up rqstp->rq_xprt
      for the back channel" is nothing but a giant hack in order
      to work around the fact that svc_process_common() uses it
      to find the xpt_ops, and perform a couple of (meaningless
      for the back channel) tests of xpt_flags.
      All we really need in svc_process_common() is to be able to run
      Bruce J Fields points that this xpo_prep_reply_hdr() call
      is an awfully roundabout way just to do "svc_putnl(resv, 0);"
      in the tcp case.
      This patch does not initialiuze rqstp->rq_xprt in bc_svc_process(),
      now it calls svc_process_common() with rqstp->rq_xprt = NULL.
      To adjust reply header svc_process_common() just check
      rqstp->rq_prot and calls svc_tcp_prep_reply_hdr() for tcp case.
      To handle rqstp->rq_xprt = NULL case in functions called from
      svc_process_common() patch intruduces net namespace pointer
      svc_rqst->rq_bc_net and adjust SVC_NET() definition.
      Some other function was also adopted to properly handle described case.
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Cc: stable@vger.kernel.org
      Fixes: 23c20ecd ("NFS: callback up - users counting cleanup")
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      v2: added lost extern svc_tcp_prep_reply_hdr()
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • WANG Chao's avatar
      x86, modpost: Replace last remnants of RETPOLINE with CONFIG_RETPOLINE · d7aff5e5
      WANG Chao authored
      commit e4f35891 upstream.
        4cd24de3 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")
      replaced the RETPOLINE define with CONFIG_RETPOLINE checks. Remove the
      remaining pieces.
       [ bp: Massage commit message. ]
      Fixes: 4cd24de3 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")
      Signed-off-by: default avatarWANG Chao <chao.wang@ucloud.cn>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarZhenzhong Duan <zhenzhong.duan@oracle.com>
      Reviewed-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
      Cc: Michal Marek <michal.lkml@markovi.net>
      Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: linux-kbuild@vger.kernel.org
      Cc: srinivas.eeda@oracle.com
      Cc: stable <stable@vger.kernel.org>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20181210163725.95977-1-chao.wang@ucloud.cnSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Viresh Kumar's avatar
      cpufreq: scpi/scmi: Fix freeing of dynamic OPPs · 94fca4b4
      Viresh Kumar authored
      commit 1690d8bb upstream.
      Since the commit 2a4eb735 "OPP: Don't remove dynamic OPPs from
      _dev_pm_opp_remove_table()", dynamically created OPP aren't
      automatically removed anymore by dev_pm_opp_cpumask_remove_table(). This
      affects the scpi and scmi cpufreq drivers which no longer free OPPs on
      failures or on invocations of the policy->exit() callback.
      Create a generic OPP helper dev_pm_opp_remove_all_dynamic() which can be
      called from these drivers instead of dev_pm_opp_cpumask_remove_table().
      In dev_pm_opp_remove_all_dynamic(), we need to make sure that the
      opp_list isn't getting accessed simultaneously from other parts of the
      OPP core while the helper is freeing dynamic OPPs, i.e. we can't drop
      the opp_table->lock while traversing through the OPP list. And to
      accomplish that, this patch also creates _opp_kref_release_unlocked()
      which can be called from this new helper with the opp_table lock already
      Cc: 4.20 <stable@vger.kernel.org> # v4.20
      Reported-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Fixes: 2a4eb735 "OPP: Don't remove dynamic OPPs from _dev_pm_opp_remove_table()"
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Tested-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Reviewed-by: default avatarSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  5. 13 Jan, 2019 2 commits
  6. 09 Jan, 2019 7 commits
    • Hans Verkuil's avatar
      media: cec: keep track of outstanding transmits · b13f60e7
      Hans Verkuil authored
      commit 32804fcb upstream.
      I noticed that repeatedly running 'cec-ctl --playback' would occasionally
      select 'Playback Device 2' instead of 'Playback Device 1', even though there
      were no other Playback devices in the HDMI topology. This happened both with
      'real' hardware and with the vivid CEC emulation, suggesting that this was an
      issue in the core code that claims a logical address.
      What 'cec-ctl --playback' does is to first clear all existing logical addresses,
      and immediately after that configure the new desired device type.
      The core code will poll the logical addresses trying to find a free address.
      When found it will issue a few standard messages as per the CEC spec and return.
      Those messages are queued up and will be transmitted asynchronously.
      What happens is that if you run two 'cec-ctl --playback' commands in quick
      succession, there is still a message of the first cec-ctl command being transmitted
      when you reconfigure the adapter again in the second cec-ctl command.
      When the logical addresses are cleared, then all information about outstanding
      transmits inside the CEC core is also cleared, and the core is no longer aware
      that there is still a transmit in flight.
      When the hardware finishes the transmit it calls transmit_done and the CEC core
      thinks it is actually in response of a POLL messages that is trying to find a
      free logical address. The result of all this is that the core thinks that the
      logical address for Playback Device 1 is in use, when it is really an earlier
      transmit that ended.
      The main transmit thread looks at adap->transmitting to check if a transmit
      is in progress, but that is set to NULL when the adapter is unconfigured.
      adap->transmitting represents the view of userspace, not that of the hardware.
      So when unconfiguring the adapter the message is marked aborted from the point
      of view of userspace, but seen from the PoV of the hardware it is still ongoing.
      So introduce a new bool transmit_in_progress that represents the hardware state
      and use that instead of adap->transmitting. Now the CEC core waits until the
      hardware finishes the transmit before starting a new transmit.
      Signed-off-by: default avatarHans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: <stable@vger.kernel.org>      # for v4.18 and up
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Todd Kjos's avatar
      binder: fix use-after-free due to ksys_close() during fdget() · 27564d8d
      Todd Kjos authored
      commit 80cd7956 upstream.
      44d8047f ("binder: use standard functions to allocate fds")
      exposed a pre-existing issue in the binder driver.
      fdget() is used in ksys_ioctl() as a performance optimization.
      One of the rules associated with fdget() is that ksys_close() must
      not be called between the fdget() and the fdput(). There is a case
      where this requirement is not met in the binder driver which results
      in the reference count dropping to 0 when the device is still in
      use. This can result in use-after-free or other issues.
      If userpace has passed a file-descriptor for the binder driver using
      a BINDER_TYPE_FDA object, then kys_close() is called on it when
      handling a binder_ioctl(BC_FREE_BUFFER) command. This violates
      the assumptions for using fdget().
      The problem is fixed by deferring the close using task_work_add(). A
      new variant of __close_fd() was created that returns a struct file
      with a reference. The fput() is deferred instead of using ksys_close().
      Fixes: 44d8047f ("binder: use standard functions to allocate fds")
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarTodd Kjos <tkjos@google.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Theodore Ts'o's avatar
      ext4: force inode writes when nfsd calls commit_metadata() · ec9639da
      Theodore Ts'o authored
      commit fde87268 upstream.
      Some time back, nfsd switched from calling vfs_fsync() to using a new
      commit_metadata() hook in export_operations().  If the file system did
      not provide a commit_metadata() hook, it fell back to using
      sync_inode_metadata().  Unfortunately doesn't work on all file
      systems.  In particular, it doesn't work on ext4 due to how the inode
      gets journalled --- the VFS writeback code will not always call
      So we need to provide our own ext4_nfs_commit_metdata() method which
      calls ext4_write_inode() directly.
      Google-Bug-Id: 121195940
      Signed-off-by: Theodore Ts'o's avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Miquel Raynal's avatar
      platform-msi: Free descriptors in platform_msi_domain_free() · 38958724
      Miquel Raynal authored
      commit 81b1e6e6 upstream.
      Since the addition of platform MSI support, there were two helpers
      supposed to allocate/free IRQs for a device:
      In these helpers, IRQ descriptors are allocated in the "alloc" routine
      while they are freed in the "free" one.
      Later, two other helpers have been added to handle IRQ domains on top
      of MSI domains:
      Seen from the outside, the logic is pretty close with the former
      helpers and people used it with the same logic as before: a
      platform_msi_domain_alloc() call should be balanced with a
      platform_msi_domain_free() call. While this is probably what was
      intended to do, the platform_msi_domain_free() does not remove/free
      the IRQ descriptor(s) created/inserted in
      One effect of such situation is that removing a module that requested
      an IRQ will let one orphaned IRQ descriptor (with an allocated MSI
      entry) in the device descriptors list. Next time the module will be
      inserted back, one will observe that the allocation will happen twice
      in the MSI domain, one time for the remaining descriptor, one time for
      the new one. It also has the side effect to quickly overshoot the
      maximum number of allocated MSI and then prevent any module requesting
      an interrupt in the same domain to be inserted anymore.
      This situation has been met with loops of insertion/removal of the
      mvpp2.ko module (requesting 15 MSIs each time).
      Fixes: 552c494a ("platform-msi: Allow creation of a MSI-based stacked irq domain")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMiquel Raynal <miquel.raynal@bootlin.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Deepa Dinamani's avatar
      sock: Make sock->sk_stamp thread-safe · a912b531
      Deepa Dinamani authored
      [ Upstream commit 3a0ed3e9 ]
      Al Viro mentioned (Message-ID
      that there is probably a race condition
      lurking in accesses of sk_stamp on 32-bit machines.
      sock->sk_stamp is of type ktime_t which is always an s64.
      On a 32 bit architecture, we might run into situations of
      unsafe access as the access to the field becomes non atomic.
      Use seqlocks for synchronization.
      This allows us to avoid using spinlocks for readers as
      readers do not need mutual exclusion.
      Another approach to solve this is to require sk_lock for all
      modifications of the timestamps. The current approach allows
      for timestamps to have their own lock: sk_stamp_lock.
      This allows for the patch to not compete with already
      existing critical sections, and side effects are limited
      to the paths in the patch.
      The addition of the new field maintains the data locality
      optimizations from
      commit 9115e8cd ("net: reorganize struct sock for better data
      Note that all the instances of the sk_stamp accesses
      are either through the ioctl or the syscall recvmsg.
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Cong Wang's avatar
      ptr_ring: wrap back ->producer in __ptr_ring_swap_queue() · 618cdf94
      Cong Wang authored
      [ Upstream commit aff6db45 ]
      __ptr_ring_swap_queue() tries to move pointers from the old
      ring to the new one, but it forgets to check if ->producer
      is beyond the new size at the end of the operation. This leads
      to an out-of-bound access in __ptr_ring_produce() as reported
      by syzbot.
      Reported-by: syzbot+8993c0fa96d57c399735@syzkaller.appspotmail.com
      Fixes: 5d49de53 ("ptr_ring: resize support")
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Willem de Bruijn's avatar
      ip: validate header length on virtual device xmit · f48945b6
      Willem de Bruijn authored
      [ Upstream commit cb9f1b78 ]
      KMSAN detected read beyond end of buffer in vti and sit devices when
      passing truncated packets with PF_PACKET. The issue affects additional
      ip tunnel devices.
      Extend commit 76c0ddd8 ("ip6_tunnel: be careful when accessing the
      inner header") and commit ccfec9e5 ("ip_tunnel: be careful when
      accessing the inner header").
      Move the check to a separate helper and call at the start of each
      ndo_start_xmit function in net/ipv4 and net/ipv6.
      Minor changes:
      - convert dev_kfree_skb to kfree_skb on error path,
        as dev_kfree_skb calls consume_skb which is not for error paths.
      - use pskb_network_may_pull even though that is pedantic here,
        as the same as pskb_may_pull for devices without llheaders.
      - do not cache ipv6 hdrs if used only once
        (unsafe across pskb_may_pull, was more relevant to earlier patch)
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  7. 22 Dec, 2018 1 commit
  8. 19 Dec, 2018 3 commits