1. 27 Feb, 2019 9 commits
    • Willem de Bruijn's avatar
      net: avoid false positives in untrusted gso validation · 0dab7eb8
      Willem de Bruijn authored
      commit 9e8db591 upstream.
      
      GSO packets with vnet_hdr must conform to a small set of gso_types.
      The below commit uses flow dissection to drop packets that do not.
      
      But it has false positives when the skb is not fully initialized.
      Dissection needs skb->protocol and skb->network_header.
      
      Infer skb->protocol from gso_type as the two must agree.
      SKB_GSO_UDP can use both ipv4 and ipv6, so try both.
      
      Exclude callers for which network header offset is not known.
      
      Fixes: d5be7f63 ("net: validate untrusted gso packets without csum offload")
      Signed-off-by: 's avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0dab7eb8
    • Willem de Bruijn's avatar
      net: validate untrusted gso packets without csum offload · f709f7b2
      Willem de Bruijn authored
      commit d5be7f63 upstream.
      
      Syzkaller again found a path to a kernel crash through bad gso input.
      By building an excessively large packet to cause an skb field to wrap.
      
      If VIRTIO_NET_HDR_F_NEEDS_CSUM was set this would have been dropped in
      skb_partial_csum_set.
      
      GSO packets that do not set checksum offload are suspicious and rare.
      Most callers of virtio_net_hdr_to_skb already pass them to
      skb_probe_transport_header.
      
      Move that test forward, change it to detect parse failure and drop
      packets on failure as those cleary are not one of the legitimate
      VIRTIO_NET_HDR_GSO types.
      
      Fixes: bfd5f4a3 ("packet: Add GSO/csum offload support.")
      Fixes: f43798c2 ("tun: Allow GSO using virtio_net_hdr")
      Reported-by: 's avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: 's avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: 's avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f709f7b2
    • Curtis Malainey's avatar
      ASoC: soc-core: fix init platform memory handling · df33b662
      Curtis Malainey authored
      commit 09ac6a81 upstream.
      
      snd_soc_init_platform initializes pointers to snd_soc_dai_link which is
      statically allocated and it does this by devm_kzalloc. In the event of
      an EPROBE_DEFER the memory will be freed and the pointers are left
      dangling. snd_soc_init_platform sees the dangling pointers and assumes
      they are pointing to initialized memory and does not reallocate them on
      the second probe attempt which results in a use after free bug since
      devm has freed the memory from the first probe attempt.
      
      Since the intention for snd_soc_dai_link->platform is that it can be set
      statically by the machine driver we need to respect the pointer in the
      event we did not set it but still catch dangling pointers. The solution
      is to add a flag to track whether the pointer was dynamically allocated
      or not.
      Signed-off-by: 's avatarCurtis Malainey <cujomalainey@chromium.org>
      Signed-off-by: 's avatarMark Brown <broonie@kernel.org>
      Cc: Jon Hunter <jonathanh@nvidia.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      df33b662
    • Eric Biggers's avatar
      KEYS: user: Align the payload buffer · cfc7fbe6
      Eric Biggers authored
      commit cc1780fc upstream.
      
      Align the payload of "user" and "logon" keys so that users of the
      keyrings service can access it as a struct that requires more than
      2-byte alignment.  fscrypt currently does this which results in the read
      of fscrypt_key::size being misaligned as it needs 4-byte alignment.
      
      Align to __alignof__(u64) rather than __alignof__(long) since in the
      future it's conceivable that people would use structs beginning with
      u64, which on some platforms would require more than 'long' alignment.
      Reported-by: 's avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Fixes: 2aa349f6 ("[PATCH] Keys: Export user-defined keyring operations")
      Fixes: 88bd6ccd ("ext4 crypto: add encryption key management facilities")
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarEric Biggers <ebiggers@google.com>
      Tested-by: 's avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: 's avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: 's avatarJames Morris <james.morris@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cfc7fbe6
    • Konstantin Khlebnikov's avatar
      inet_diag: fix reporting cgroup classid and fallback to priority · 25b3f56a
      Konstantin Khlebnikov authored
      [ Upstream commit 1ec17dbd ]
      
      Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested
      extensions has only 8 bits. Thus extensions starting from DCTCPINFO
      cannot be requested directly. Some of them included into response
      unconditionally or hook into some of lower 8 bits.
      
      Extension INET_DIAG_CLASS_ID has not way to request from the beginning.
      
      This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space
      reservation, and documents behavior for other extensions.
      
      Also this patch adds fallback to reporting socket priority. This filed
      is more widely used for traffic classification because ipv4 sockets
      automatically maps TOS to priority and default qdisc pfifo_fast knows
      about that. But priority could be changed via setsockopt SO_PRIORITY so
      INET_DIAG_TOS isn't enough for predicting class.
      
      Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot
      reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen).
      
      So, after this patch INET_DIAG_CLASS_ID will report socket priority
      for most common setup when net_cls isn't set and/or cgroup2 in use.
      
      Fixes: 0888e372 ("net: inet: diag: expose sockets cgroup classid")
      Signed-off-by: 's avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25b3f56a
    • David Howells's avatar
      afs: Fix race in async call refcounting · 12a072a0
      David Howells authored
      [ Upstream commit 34fa4761 ]
      
      There's a race between afs_make_call() and afs_wake_up_async_call() in the
      case that an error is returned from rxrpc_kernel_send_data() after it has
      queued the final packet.
      
      afs_make_call() will try and clean up the mess, but the call state may have
      been moved on thereby causing afs_process_async_call() to also try and to
      delete the call.
      
      Fix this by:
      
       (1) Getting an extra ref for an asynchronous call for the call itself to
           hold.  This makes sure the call doesn't evaporate on us accidentally
           and will allow the call to be retained by the caller in a future
           patch.  The ref is released on leaving afs_make_call() or
           afs_wait_for_call_to_complete().
      
       (2) In the event of an error from rxrpc_kernel_send_data():
      
           (a) Don't set the call state to AFS_CALL_COMPLETE until *after* the
           	 call has been aborted and ended.  This prevents
           	 afs_deliver_to_call() from doing anything with any notifications
           	 it gets.
      
           (b) Explicitly end the call immediately to prevent further callbacks.
      
           (c) Cancel any queued async_work and wait for the work if it's
           	 executing.  This allows us to be sure the race won't recur when we
           	 change the state.  We put the work queue's ref on the call if we
           	 managed to cancel it.
      
           (d) Put the call's ref that we got in (1).  This belongs to us as long
           	 as the call is in state AFS_CALL_CL_REQUESTING.
      
      Fixes: 341f741f ("afs: Refcount the afs_call struct")
      Signed-off-by: 's avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      12a072a0
    • wenxu's avatar
      netfilter: nft_flow_offload: fix interaction with vrf slave device · 6b08e8a0
      wenxu authored
      [ Upstream commit 10f4e765 ]
      
      In the forward chain, the iif is changed from slave device to master vrf
      device. Thus, flow offload does not find a match on the lower slave
      device.
      
      This patch uses the cached route, ie. dst->dev, to update the iif and
      oif fields in the flow entry.
      
      After this patch, the following example works fine:
      
       # ip addr add dev eth0 1.1.1.1/24
       # ip addr add dev eth1 10.0.0.1/24
       # ip link add user1 type vrf table 1
       # ip l set user1 up
       # ip l set dev eth0 master user1
       # ip l set dev eth1 master user1
      
       # nft add table firewall
       # nft add flowtable f fb1 { hook ingress priority 0 \; devices = { eth0, eth1 } \; }
       # nft add chain f ftb-all {type filter hook forward priority 0 \; policy accept \; }
       # nft add rule f ftb-all ct zone 1 ip protocol tcp flow offload @fb1
       # nft add rule f ftb-all ct zone 1 ip protocol udp flow offload @fb1Signed-off-by: 's avatarwenxu <wenxu@ucloud.cn>
      Signed-off-by: 's avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      6b08e8a0
    • Michael S. Tsirkin's avatar
      include/linux/compiler*.h: fix OPTIMIZER_HIDE_VAR · d3082183
      Michael S. Tsirkin authored
      [ Upstream commit 3e2ffd65 ]
      
      Since commit 815f0ddb ("include/linux/compiler*.h: make compiler-*.h
      mutually exclusive") clang no longer reuses the OPTIMIZER_HIDE_VAR macro
      from compiler-gcc - instead it gets the version in
      include/linux/compiler.h.  Unfortunately that version doesn't actually
      prevent compiler from optimizing out the variable.
      
      Fix up by moving the macro out from compiler-gcc.h to compiler.h.
      Compilers without incline asm support will keep working
      since it's protected by an ifdef.
      
      Also fix up comments to match reality since we are no longer overriding
      any macros.
      
      Build-tested with gcc and clang.
      
      Fixes: 815f0ddb ("include/linux/compiler*.h: make compiler-*.h mutually exclusive")
      Cc: Eli Friedman <efriedma@codeaurora.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: 's avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: 's avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: Miguel Ojeda's avatarMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      d3082183
    • Denis Bolotin's avatar
      qed: Fix qed_chain_set_prod() for PBL chains with non power of 2 page count · 92acaa68
      Denis Bolotin authored
      [ Upstream commit 2d533a92 ]
      
      In PBL chains with non power of 2 page count, the producer is not at the
      beginning of the chain when index is 0 after a wrap. Therefore, after the
      producer index wrap around, page index should be calculated more carefully.
      Signed-off-by: 's avatarDenis Bolotin <dbolotin@marvell.com>
      Signed-off-by: 's avatarAriel Elior <aelior@marvell.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      92acaa68
  2. 23 Feb, 2019 6 commits
    • Eric Dumazet's avatar
      ax25: fix possible use-after-free · 57863611
      Eric Dumazet authored
      commit 63530aba upstream.
      
      syzbot found that ax25 routes where not properly protected
      against concurrent use [1].
      
      In this particular report the bug happened while
      copying ax25->digipeat.
      
      Fix this problem by making sure we call ax25_get_route()
      while ax25_route_lock is held, so that no modification
      could happen while using the route.
      
      The current two ax25_get_route() callers do not sleep,
      so this change should be fine.
      
      Once we do that, ax25_get_route() no longer needs to
      grab a reference on the found route.
      
      [1]
      ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de
      BUG: KASAN: use-after-free in memcpy include/linux/string.h:352 [inline]
      BUG: KASAN: use-after-free in kmemdup+0x42/0x60 mm/util.c:113
      Read of size 66 at addr ffff888066641a80 by task syz-executor2/531
      
      ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de
      CPU: 1 PID: 531 Comm: syz-executor2 Not tainted 5.0.0-rc2+ #10
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1db/0x2d0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       check_memory_region_inline mm/kasan/generic.c:185 [inline]
       check_memory_region+0x123/0x190 mm/kasan/generic.c:191
       memcpy+0x24/0x50 mm/kasan/common.c:130
       memcpy include/linux/string.h:352 [inline]
       kmemdup+0x42/0x60 mm/util.c:113
       kmemdup include/linux/string.h:425 [inline]
       ax25_rt_autobind+0x25d/0x750 net/ax25/ax25_route.c:424
       ax25_connect.cold+0x30/0xa4 net/ax25/af_ax25.c:1224
       __sys_connect+0x357/0x490 net/socket.c:1664
       __do_sys_connect net/socket.c:1675 [inline]
       __se_sys_connect net/socket.c:1672 [inline]
       __x64_sys_connect+0x73/0xb0 net/socket.c:1672
       do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x458099
      Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f870ee22c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458099
      RDX: 0000000000000048 RSI: 0000000020000080 RDI: 0000000000000005
      RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
      ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f870ee236d4
      R13: 00000000004be48e R14: 00000000004ce9a8 R15: 00000000ffffffff
      
      Allocated by task 526:
       save_stack+0x45/0xd0 mm/kasan/common.c:73
       set_track mm/kasan/common.c:85 [inline]
       __kasan_kmalloc mm/kasan/common.c:496 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:469
       kasan_kmalloc+0x9/0x10 mm/kasan/common.c:504
      ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de
       kmem_cache_alloc_trace+0x151/0x760 mm/slab.c:3609
       kmalloc include/linux/slab.h:545 [inline]
       ax25_rt_add net/ax25/ax25_route.c:95 [inline]
       ax25_rt_ioctl+0x3b9/0x1270 net/ax25/ax25_route.c:233
       ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763
       sock_do_ioctl+0xe2/0x400 net/socket.c:950
       sock_ioctl+0x32f/0x6c0 net/socket.c:1074
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:509 [inline]
       do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696
       ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de
      Freed by task 550:
       save_stack+0x45/0xd0 mm/kasan/common.c:73
       set_track mm/kasan/common.c:85 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:458
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:466
       __cache_free mm/slab.c:3487 [inline]
       kfree+0xcf/0x230 mm/slab.c:3806
       ax25_rt_add net/ax25/ax25_route.c:92 [inline]
       ax25_rt_ioctl+0x304/0x1270 net/ax25/ax25_route.c:233
       ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763
       sock_do_ioctl+0xe2/0x400 net/socket.c:950
       sock_ioctl+0x32f/0x6c0 net/socket.c:1074
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:509 [inline]
       do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696
       ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff888066641a80
       which belongs to the cache kmalloc-96 of size 96
      The buggy address is located 0 bytes inside of
       96-byte region [ffff888066641a80, ffff888066641ae0)
      The buggy address belongs to the page:
      page:ffffea0001999040 count:1 mapcount:0 mapping:ffff88812c3f04c0 index:0x0
      flags: 0x1fffc0000000200(slab)
      ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de
      raw: 01fffc0000000200 ffffea0001817948 ffffea0002341dc8 ffff88812c3f04c0
      raw: 0000000000000000 ffff888066641000 0000000100000020 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888066641980: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
       ffff888066641a00: 00 00 00 00 00 00 00 00 02 fc fc fc fc fc fc fc
      >ffff888066641a80: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
                         ^
       ffff888066641b00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
       ffff888066641b80: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
      Signed-off-by: 's avatarEric Dumazet <edumazet@google.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Reported-by: 's avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      57863611
    • Ard Biesheuvel's avatar
      efi/arm: Revert "Defer persistent reservations until after paging_init()" · da9365ad
      Ard Biesheuvel authored
      Commit 582a32e7 upstream.
      
      This reverts commit eff89628, which
      deferred the processing of persistent memory reservations to a point
      where the memory may have already been allocated and overwritten,
      defeating the purpose.
      Signed-off-by: 's avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: 's avatarWill Deacon <will.deacon@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190215123333.21209-3-ard.biesheuvel@linaro.orgSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: 's avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      da9365ad
    • Ard Biesheuvel's avatar
      arm64, mm, efi: Account for GICv3 LPI tables in static memblock reserve table · de3d833d
      Ard Biesheuvel authored
      Commit 8a5b403d upstream.
      
      In the irqchip and EFI code, we have what basically amounts to a quirk
      to work around a peculiarity in the GICv3 architecture, which permits
      the system memory address of LPI tables to be programmable only once
      after a CPU reset. This means kexec kernels must use the same memory
      as the first kernel, and thus ensure that this memory has not been
      given out for other purposes by the time the ITS init code runs, which
      is not very early for secondary CPUs.
      
      On systems with many CPUs, these reservations could overflow the
      memblock reservation table, and this was addressed in commit:
      
        eff89628 ("efi/arm: Defer persistent reservations until after paging_init()")
      
      However, this turns out to have made things worse, since the allocation
      of page tables and heap space for the resized memblock reservation table
      itself may overwrite the regions we are attempting to reserve, which may
      cause all kinds of corruption, also considering that the ITS will still
      be poking bits into that memory in response to incoming MSIs.
      
      So instead, let's grow the static memblock reservation table on such
      systems so it can accommodate these reservations at an earlier time.
      This will permit us to revert the above commit in a subsequent patch.
      
      [ mingo: Minor cleanups. ]
      Signed-off-by: 's avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: 's avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: 's avatarWill Deacon <will.deacon@arm.com>
      Acked-by: 's avatarMarc Zyngier <marc.zyngier@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190215123333.21209-2-ard.biesheuvel@linaro.orgSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      [ ardb: Double the size of the slack to account for the lack of an
              optimization that was introduced in mainline after the release
              of v4.20. ]
      Signed-off-by: 's avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      de3d833d
    • David S. Miller's avatar
      net: Add header for usage of fls64() · 0009ef57
      David S. Miller authored
      [ Upstream commit 8681ef1f ]
      
      Fixes: 3b89ea9c ("net: Fix for_each_netdev_feature on Big endian")
      Suggested-by: 's avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      0009ef57
    • Hauke Mehrtens's avatar
      net: Fix for_each_netdev_feature on Big endian · 6d7a9a07
      Hauke Mehrtens authored
      [ Upstream commit 3b89ea9c ]
      
      The features attribute is of type u64 and stored in the native endianes on
      the system. The for_each_set_bit() macro takes a pointer to a 32 bit array
      and goes over the bits in this area. On little Endian systems this also
      works with an u64 as the most significant bit is on the highest address,
      but on big endian the words are swapped. When we expect bit 15 here we get
      bit 47 (15 + 32).
      
      This patch converts it more or less to its own for_each_set_bit()
      implementation which works on 64 bit integers directly. This is then
      completely in host endianness and should work like expected.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      Signed-off-by: 's avatarHauke Mehrtens <hauke.mehrtens@intel.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      6d7a9a07
    • Lorenzo Bianconi's avatar
      net: ipv4: use a dedicated counter for icmp_v4 redirect packets · 5d5c002b
      Lorenzo Bianconi authored
      [ Upstream commit c09551c6 ]
      
      According to the algorithm described in the comment block at the
      beginning of ip_rt_send_redirect, the host should try to send
      'ip_rt_redirect_number' ICMP redirect packets with an exponential
      backoff and then stop sending them at all assuming that the destination
      ignores redirects.
      If the device has previously sent some ICMP error packets that are
      rate-limited (e.g TTL expired) and continues to receive traffic,
      the redirect packets will never be transmitted. This happens since
      peer->rate_tokens will be typically greater than 'ip_rt_redirect_number'
      and so it will never be reset even if the redirect silence timeout
      (ip_rt_redirect_silence) has elapsed without receiving any packet
      requiring redirects.
      
      Fix it by using a dedicated counter for the number of ICMP redirect
      packets that has been sent by the host
      
      I have not been able to identify a given commit that introduced the
      issue since ip_rt_send_redirect implements the same rate-limiting
      algorithm from commit 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: 's avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      5d5c002b
  3. 20 Feb, 2019 2 commits
    • Zachary Hays's avatar
      mmc: block: handle complete_work on separate workqueue · 1fdf7b8b
      Zachary Hays authored
      commit dcf6e2e3 upstream.
      
      The kblockd workqueue is created with the WQ_MEM_RECLAIM flag set.
      This generates a rescuer thread for that queue that will trigger when
      the CPU is under heavy load and collect the uncompleted work.
      
      In the case of mmc, this creates the possibility of a deadlock when
      there are multiple partitions on the device as other blk-mq work is
      also run on the same queue. For example:
      
      - worker 0 claims the mmc host to work on partition 1
      - worker 1 attempts to claim the host for partition 2 but has to wait
        for worker 0 to finish
      - worker 0 schedules complete_work to release the host
      - rescuer thread is triggered after time-out and collects the dangling
        work
      - rescuer thread attempts to complete the work in order starting with
        claim host
      - the task to release host is now blocked by a task to claim it and
        will never be called
      
      The above results in multiple hung tasks that lead to failures to
      mount partitions.
      
      Handling complete_work on a separate workqueue avoids this by keeping
      the work completion tasks separate from the other blk-mq work. This
      allows the host to be released without getting blocked by other tasks
      attempting to claim the host.
      Signed-off-by: 's avatarZachary Hays <zhays@lexmark.com>
      Fixes: 81196976 ("mmc: block: Add blk-mq support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1fdf7b8b
    • Jiri Olsa's avatar
      perf/x86: Add check_period PMU callback · 6a66c2d0
      Jiri Olsa authored
      commit 81ec3f3c upstream.
      
      Vince (and later on Ravi) reported crashes in the BTS code during
      fuzzing with the following backtrace:
      
        general protection fault: 0000 [#1] SMP PTI
        ...
        RIP: 0010:perf_prepare_sample+0x8f/0x510
        ...
        Call Trace:
         <IRQ>
         ? intel_pmu_drain_bts_buffer+0x194/0x230
         intel_pmu_drain_bts_buffer+0x160/0x230
         ? tick_nohz_irq_exit+0x31/0x40
         ? smp_call_function_single_interrupt+0x48/0xe0
         ? call_function_single_interrupt+0xf/0x20
         ? call_function_single_interrupt+0xa/0x20
         ? x86_schedule_events+0x1a0/0x2f0
         ? x86_pmu_commit_txn+0xb4/0x100
         ? find_busiest_group+0x47/0x5d0
         ? perf_event_set_state.part.42+0x12/0x50
         ? perf_mux_hrtimer_restart+0x40/0xb0
         intel_pmu_disable_event+0xae/0x100
         ? intel_pmu_disable_event+0xae/0x100
         x86_pmu_stop+0x7a/0xb0
         x86_pmu_del+0x57/0x120
         event_sched_out.isra.101+0x83/0x180
         group_sched_out.part.103+0x57/0xe0
         ctx_sched_out+0x188/0x240
         ctx_resched+0xa8/0xd0
         __perf_event_enable+0x193/0x1e0
         event_function+0x8e/0xc0
         remote_function+0x41/0x50
         flush_smp_call_function_queue+0x68/0x100
         generic_smp_call_function_single_interrupt+0x13/0x30
         smp_call_function_single_interrupt+0x3e/0xe0
         call_function_single_interrupt+0xf/0x20
         </IRQ>
      
      The reason is that while event init code does several checks
      for BTS events and prevents several unwanted config bits for
      BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
      to create BTS event without those checks being done.
      
      Following sequence will cause the crash:
      
      If we create an 'almost' BTS event with precise_ip and callchains,
      and it into a BTS event it will crash the perf_prepare_sample()
      function because precise_ip events are expected to come
      in with callchain data initialized, but that's not the
      case for intel_pmu_drain_bts_buffer() caller.
      
      Adding a check_period callback to be called before the period
      is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
      if the event would become BTS. Plus adding also the limit_period
      check as well.
      Reported-by: 's avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: 's avatarJiri Olsa <jolsa@kernel.org>
      Acked-by: 's avatarPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190204123532.GA4794@kravaSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a66c2d0
  4. 12 Feb, 2019 10 commits
    • Josh Poimboeuf's avatar
      cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM · f29a8be0
      Josh Poimboeuf authored
      commit b284909a upstream.
      
      With the following commit:
      
        73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      
      ... the hotplug code attempted to detect when SMT was disabled by BIOS,
      in which case it reported SMT as permanently disabled.  However, that
      code broke a virt hotplug scenario, where the guest is booted with only
      primary CPU threads, and a sibling is brought online later.
      
      The problem is that there doesn't seem to be a way to reliably
      distinguish between the HW "SMT disabled by BIOS" case and the virt
      "sibling not yet brought online" case.  So the above-mentioned commit
      was a bit misguided, as it permanently disabled SMT for both cases,
      preventing future virt sibling hotplugs.
      
      Going back and reviewing the original problems which were attempted to
      be solved by that commit, when SMT was disabled in BIOS:
      
        1) /sys/devices/system/cpu/smt/control showed "on" instead of
           "notsupported"; and
      
        2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
      
      I'd propose that we instead consider #1 above to not actually be a
      problem.  Because, at least in the virt case, it's possible that SMT
      wasn't disabled by BIOS and a sibling thread could be brought online
      later.  So it makes sense to just always default the smt control to "on"
      to allow for that possibility (assuming cpuid indicates that the CPU
      supports SMT).
      
      The real problem is #2, which has a simple fix: change vmx_vm_init() to
      query the actual current SMT state -- i.e., whether any siblings are
      currently online -- instead of looking at the SMT "control" sysfs value.
      
      So fix it by:
      
        a) reverting the original "fix" and its followup fix:
      
           73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
           bc2d8d26 ("cpu/hotplug: Fix SMT supported evaluation")
      
           and
      
        b) changing vmx_vm_init() to query the actual current SMT state --
           instead of the sysfs control value -- to determine whether the L1TF
           warning is needed.  This also requires the 'sched_smt_present'
           variable to exported, instead of 'cpu_smt_control'.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: 's avatarIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: 's avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.comSigned-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      f29a8be0
    • Vladis Dronov's avatar
      HID: debug: fix the ring buffer implementation · a8d5fb2f
      Vladis Dronov authored
      commit 13054abb upstream.
      
      Ring buffer implementation in hid_debug_event() and hid_debug_events_read()
      is strange allowing lost or corrupted data. After commit 717adfda
      ("HID: debug: check length before copy_to_user()") it is possible to enter
      an infinite loop in hid_debug_events_read() by providing 0 as count, this
      locks up a system. Fix this by rewriting the ring buffer implementation
      with kfifo and simplify the code.
      
      This fixes CVE-2019-3819.
      
      v2: fix an execution logic and add a comment
      v3: use __set_current_state() instead of set_current_state()
      
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1669187
      Cc: stable@vger.kernel.org # v4.18+
      Fixes: cd667ce2 ("HID: use debugfs for events/reports dumping")
      Fixes: 717adfda ("HID: debug: check length before copy_to_user()")
      Signed-off-by: 's avatarVladis Dronov <vdronov@redhat.com>
      Reviewed-by: 's avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: 's avatarBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8d5fb2f
    • Takashi Iwai's avatar
      ALSA: hda - Serialize codec registrations · 970ad7a2
      Takashi Iwai authored
      commit 305a0ade upstream.
      
      In the current code, the codec registration may happen both at the
      codec bind time and the end of the controller probe time.  In a rare
      occasion, they race with each other, leading to Oops due to the still
      uninitialized card device.
      
      This patch introduces a simple flag to prevent the codec registration
      at the codec bind time as long as the controller probe is going on.
      The controller probe invokes snd_card_register() that does the whole
      registration task, and we don't need to register each piece
      beforehand.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      970ad7a2
    • Charles Keepax's avatar
      ALSA: compress: Fix stop handling on compressed capture streams · 18a47b48
      Charles Keepax authored
      commit 4f2ab5e1 upstream.
      
      It is normal user behaviour to start, stop, then start a stream
      again without closing it. Currently this works for compressed
      playback streams but not capture ones.
      
      The states on a compressed capture stream go directly from OPEN to
      PREPARED, unlike a playback stream which moves to SETUP and waits
      for a write of data before moving to PREPARED. Currently however,
      when a stop is sent the state is set to SETUP for both types of
      streams. This leaves a capture stream in the situation where a new
      start can't be sent as that requires the state to be PREPARED and
      a new set_params can't be sent as that requires the state to be
      OPEN. The only option being to close the stream, and then reopen.
      
      Correct this issues by allowing snd_compr_drain_notify to set the
      state depending on the stream direction, as we already do in
      set_params.
      
      Fixes: 49bb6402 ("ALSA: compress_core: Add support for capture streams")
      Signed-off-by: 's avatarCharles Keepax <ckeepax@opensource.cirrus.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18a47b48
    • Jim Mattson's avatar
      kvm: Change offset in kvm_write_guest_offset_cached to unsigned · 0635be29
      Jim Mattson authored
      [ Upstream commit 7a86dab8 ]
      
      Since the offset is added directly to the hva from the
      gfn_to_hva_cache, a negative offset could result in an out of bounds
      write. The existing BUG_ON only checks for addresses beyond the end of
      the gfn_to_hva_cache, not for addresses before the start of the
      gfn_to_hva_cache.
      
      Note that all current call sites have non-negative offsets.
      
      Fixes: 4ec6e863 ("kvm: Introduce kvm_write_guest_offset_cached()")
      Reported-by: 's avatarCfir Cohen <cfir@google.com>
      Signed-off-by: 's avatarJim Mattson <jmattson@google.com>
      Reviewed-by: 's avatarCfir Cohen <cfir@google.com>
      Reviewed-by: 's avatarPeter Shier <pshier@google.com>
      Reviewed-by: 's avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: 's avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: 's avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      0635be29
    • John Fastabend's avatar
      bpf: sk_msg, fix socket data_ready events · 47c12a8a
      John Fastabend authored
      [ Upstream commit 552de910 ]
      
      When a skb verdict program is in-use and either another BPF program
      redirects to that socket or the new SK_PASS support is used the
      data_ready callback does not wake up application. Instead because
      the stream parser/verdict is using the sk data_ready callback we wake
      up the stream parser/verdict block.
      
      Fix this by adding a helper to check if the stream parser block is
      enabled on the sk and if so call the saved pointer which is the
      upper layers wake up function.
      
      This fixes application stalls observed when an application is waiting
      for data in a blocking read().
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: 's avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      47c12a8a
    • Nathan Chancellor's avatar
      drbd: Avoid Clang warning about pointless switch statment · bee009d6
      Nathan Chancellor authored
      [ Upstream commit a52c5a16 ]
      
      There are several warnings from Clang about no case statement matching
      the constant 0:
      
      In file included from drivers/block/drbd/drbd_receiver.c:48:
      In file included from drivers/block/drbd/drbd_int.h:48:
      In file included from ./include/linux/drbd_genl_api.h:54:
      In file included from ./include/linux/genl_magic_struct.h:236:
      ./include/linux/drbd_genl.h:321:1: warning: no case matching constant
      switch condition '0'
      GENL_struct(DRBD_NLA_HELPER, 24, drbd_helper_info,
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ./include/linux/genl_magic_struct.h:220:10: note: expanded from macro
      'GENL_struct'
              switch (0) {
                      ^
      
      Silence this warning by adding a 'case 0:' statement. Additionally,
      adjust the alignment of the statements in the ct_assert_unique macro to
      avoid a checkpatch warning.
      
      This solution was originally sent by Arnd Bergmann with a default case
      statement: https://lore.kernel.org/patchwork/patch/756723/
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/43Suggested-by: 's avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: Nathan Chancellor's avatarNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      bee009d6
    • Parav Pandit's avatar
      RDMA/core: Sync unregistration with netlink commands · f9acb020
      Parav Pandit authored
      [ Upstream commit 01b67117 ]
      
      When the rdma device is getting removed, get resource info can race with
      device removal, as below:
      
            CPU-0                                  CPU-1
          --------                               --------
          rdma_nl_rcv_msg()
             nldev_res_get_cq_dumpit()
                mutex_lock(device_lock);
                get device reference
                mutex_unlock(device_lock);        [..]
                                                  ib_unregister_device()
                                                  /* Valid reference to
                                                   * device->dev exists.
                                                   */
                                                   ib_dealloc_device()
      
                [..]
                provider->fill_res_entry();
      
      Even though device object is not freed, fill_res_entry() can get called on
      device which doesn't have a driver anymore. Kernel core device reference
      count is not sufficient, as this only keeps the structure valid, and
      doesn't guarantee the driver is still loaded.
      
      Similar race can occur with device renaming and device removal, where
      device_rename() tries to rename a unregistered device. While this is fine
      for devices of a class which are not net namespace aware, but it is
      incorrect for net namespace aware class coming in subsequent series.  If a
      class is net namespace aware, then the below [1] call trace is observed in
      above situation.
      
      Therefore, to avoid the race, keep a reference count and let device
      unregistration wait until all netlink users drop the reference.
      
      [1] Call trace:
      kernfs: ns required in 'infiniband' for 'mlx5_0'
      WARNING: CPU: 18 PID: 44270 at fs/kernfs/dir.c:842 kernfs_find_ns+0x104/0x120
      libahci i2c_core mlxfw libata dca [last unloaded: devlink]
      RIP: 0010:kernfs_find_ns+0x104/0x120
      Call Trace:
      kernfs_find_and_get_ns+0x2e/0x50
      sysfs_rename_link_ns+0x40/0xb0
      device_rename+0xb2/0xf0
      ib_device_rename+0xb3/0x100 [ib_core]
      nldev_set_doit+0x165/0x190 [ib_core]
      rdma_nl_rcv_msg+0x249/0x250 [ib_core]
      ? netlink_deliver_tap+0x8f/0x3e0
      rdma_nl_rcv+0xd6/0x120 [ib_core]
      netlink_unicast+0x17c/0x230
      netlink_sendmsg+0x2f0/0x3e0
      sock_sendmsg+0x30/0x40
      __sys_sendto+0xdc/0x160
      
      Fixes: da5c8507 ("RDMA/nldev: add driver-specific resource tracking")
      Signed-off-by: 's avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: 's avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: 's avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      f9acb020
    • Saeed Mahameed's avatar
      net/mlx5: EQ, Use the right place to store/read IRQ affinity hint · 57da3b1f
      Saeed Mahameed authored
      [ Upstream commit 1e86ace4 ]
      
      Currently the cpu affinity hint mask for completion EQs is stored and
      read from the wrong place, since reading and storing is done from the
      same index, there is no actual issue with that, but internal irq_info
      for completion EQs stars at MLX5_EQ_VEC_COMP_BASE offset in irq_info
      array, this patch changes the code to use the correct offset to store
      and read the IRQ affinity hint.
      Signed-off-by: 's avatarSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: 's avatarLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: 's avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: 's avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      57da3b1f
    • Muchun Song's avatar
      gpiolib: Fix possible use after free on label · 4672971c
      Muchun Song authored
      [ Upstream commit 18534df4 ]
      
      gpiod_request_commit() copies the pointer to the label passed as
      an argument only to be used later. But there's a chance the caller
      could immediately free the passed string(e.g., local variable).
      This could trigger a use after free when we use gpio label(e.g.,
      gpiochip_unlock_as_irq(), gpiochip_is_requested()).
      
      To be on the safe side: duplicate the string with kstrdup_const()
      so that if an unaware user passes an address to a stack-allocated
      buffer, we won't get the arbitrary label.
      
      Also fix gpiod_set_consumer_name().
      Signed-off-by: 's avatarMuchun Song <smuchun@gmail.com>
      Signed-off-by: 's avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      4672971c
  5. 06 Feb, 2019 4 commits
    • Frank Rowand's avatar
      of: overlay: add tests to validate kfrees from overlay removal · 0aa65adc
      Frank Rowand authored
      commit 144552c7 upstream.
      
      Add checks:
        - attempted kfree due to refcount reaching zero before overlay
          is removed
        - properties linked to an overlay node when the node is removed
        - node refcount > one during node removal in a changeset destroy,
          if the node was created by the changeset
      
      After applying this patch, several validation warnings will be
      reported from the devicetree unittest during boot due to
      pre-existing devicetree bugs. The warnings will be similar to:
      
        OF: ERROR: of_node_release(), unexpected properties in /testcase-data/overlay-node/test-bus/test-unittest11
        OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /testcase-data-2/substation@100/
        hvac-medium-2
      Tested-by: 's avatarAlan Tull <atull@kernel.org>
      Signed-off-by: 's avatarFrank Rowand <frank.rowand@sony.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0aa65adc
    • Tetsuo Handa's avatar
      oom, oom_reaper: do not enqueue same task twice · 9f1a96b1
      Tetsuo Handa authored
      commit 9bcdeb51 upstream.
      
      Arkadiusz reported that enabling memcg's group oom killing causes
      strange memcg statistics where there is no task in a memcg despite the
      number of tasks in that memcg is not 0.  It turned out that there is a
      bug in wake_oom_reaper() which allows enqueuing same task twice which
      makes impossible to decrease the number of tasks in that memcg due to a
      refcount leak.
      
      This bug existed since the OOM reaper became invokable from
      task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
      
        T1@P1     |T2@P1     |T3@P1     |OOM reaper
        ----------+----------+----------+------------
                                         # Processing an OOM victim in a different memcg domain.
                              try_charge()
                                mem_cgroup_out_of_memory()
                                  mutex_lock(&oom_lock)
                   try_charge()
                     mem_cgroup_out_of_memory()
                       mutex_lock(&oom_lock)
        try_charge()
          mem_cgroup_out_of_memory()
            mutex_lock(&oom_lock)
                                  out_of_memory()
                                    oom_kill_process(P1)
                                      do_send_sig_info(SIGKILL, @p1)
                                      mark_oom_victim(T1@P1)
                                      wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                                  mutex_unlock(&oom_lock)
                       out_of_memory()
                         mark_oom_victim(T2@P1)
                         wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                       mutex_unlock(&oom_lock)
            out_of_memory()
              mark_oom_victim(T1@P1)
              wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
            mutex_unlock(&oom_lock)
                                         # Completed processing an OOM victim in a different memcg domain.
                                         spin_lock(&oom_reaper_lock)
                                         # T1P1 is dequeued.
                                         spin_unlock(&oom_reaper_lock)
      
      but memcg's group oom killing made it easier to trigger this bug by
      calling wake_oom_reaper() on the same task from one out_of_memory()
      request.
      
      Fix this bug using an approach used by commit 855b0183 ("oom,
      oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
      side effect of this patch, this patch also avoids enqueuing multiple
      threads sharing memory via task_will_free_mem(current) path.
      
      Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
      Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
      Fixes: af8e15cc ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
      Signed-off-by: 's avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: arekm's avatarArkadiusz Miskiewicz <arekm@maven.pl>
      Tested-by: arekm's avatarArkadiusz Miskiewicz <arekm@maven.pl>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Acked-by: 's avatarRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <asarai@suse.de>
      Cc: Jay Kamat <jgkamat@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f1a96b1
    • Dave Watson's avatar
      net: tls: Save iv in tls_rec for async crypto requests · cb77e08d
      Dave Watson authored
      [ Upstream commit 32eb67b9 ]
      
      aead_request_set_crypt takes an iv pointer, and we change the iv
      soon after setting it.  Some async crypto algorithms don't save the iv,
      so we need to save it in the tls_rec for async requests.
      
      Found by hardcoding x64 aesni to use async crypto manager (to test the async
      codepath), however I don't think this combination can happen in the wild.
      Presumably other hardware offloads will need this fix, but there have been
      no user reports.
      
      Fixes: a42055e8 ("Add support for async encryption of records...")
      Signed-off-by: 's avatarDave Watson <davejwatson@fb.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb77e08d
    • Daniel Borkmann's avatar
      ipvlan, l3mdev: fix broken l3s mode wrt local routes · 61d228f9
      Daniel Borkmann authored
      [ Upstream commit d5256083 ]
      
      While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
      I ran into the issue that while l3 mode is working fine, l3s mode
      does not have any connectivity to kube-apiserver and hence all pods
      end up in Error state as well. The ipvlan master device sits on
      top of a bond device and hostns traffic to kube-apiserver (also running
      in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
      where the latter is the address of the bond0. While in l3 mode, a
      curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
      works fine from hostns, neither of them do in case of l3s. In the
      latter only a curl to https://127.0.0.1:37573 appeared to work where
      for local addresses of bond0 I saw kernel suddenly starting to emit
      ARP requests to query HW address of bond0 which remained unanswered
      and neighbor entries in INCOMPLETE state. These ARP requests only
      happen while in l3s.
      
      Debugging this further, I found the issue is that l3s mode is piggy-
      backing on l3 master device, and in this case local routes are using
      l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
      f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev
      if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be
      a loopback"). I found that reverting them back into using the
      net->loopback_dev fixed ipvlan l3s connectivity and got everything
      working for the CNI.
      
      Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the
      l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
      on l3 master device is to get the l3mdev_ip_rcv() receive hook for
      setting the dst entry of the input route without adding its own
      ipvlan specific hacks into the receive path, however, any l3 domain
      semantics beyond just that are breaking l3s operation. Note that
      ipvlan also has the ability to dynamically switch its internal
      operation from l3 to l3s for all ports via ipvlan_set_port_mode()
      at runtime. In any case, l3 vs l3s soley distinguishes itself by
      'de-confusing' netfilter through switching skb->dev to ipvlan slave
      device late in NF_INET_LOCAL_IN before handing the skb to L4.
      
      Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
      if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
      without any additional l3mdev semantics on top. This should also have
      minimal impact since dev->priv_flags is already hot in cache. With
      this set, l3s mode is working fine and I also get things like
      masquerading pod traffic on the ipvlan master properly working.
      
        [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Martynas Pumputis <m@lambda.lt>
      Acked-by: 's avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61d228f9
  6. 31 Jan, 2019 9 commits
    • Deepa Dinamani's avatar
      Input: input_event - fix the CONFIG_SPARC64 mixup · 4c8e5815
      Deepa Dinamani authored
      commit 141e5dca upstream.
      
      Arnd Bergmann pointed out that CONFIG_* cannot be used in a uapi header.
      Override with an equivalent conditional.
      
      Fixes: 2e746942 ("Input: input_event - provide override for sparc64")
      Fixes: 152194fe ("Input: extend usable life of event timestamps to 2106 on 32 bit systems")
      Signed-off-by: 's avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: 's avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4c8e5815
    • Dexuan Cui's avatar
      Drivers: hv: vmbus: Remove the useless API vmbus_get_outgoing_channel() · e4266dff
      Dexuan Cui authored
      [ Upstream commit 4d3c5c69 ]
      
      Commit d86adf48 ("scsi: storvsc: Enable multi-queue support") removed
      the usage of the API in Jan 2017, and the API is not used since then.
      
      netvsc and storvsc have their own algorithms to determine the outgoing
      channel, so this API is useless.
      
      And the API is potentially unsafe, because it reads primary->num_sc without
      any lock held. This can be risky considering the RESCIND-OFFER message.
      
      Let's remove the API.
      
      Cc: Long Li <longli@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: 's avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: 's avatarK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: 's avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      e4266dff
    • Daniel Borkmann's avatar
      bpf: fix sanitation of alu op with pointer / scalar type from different paths · 4bce22c3
      Daniel Borkmann authored
      [ commit d3bd7413 upstream ]
      
      While 979d63d5 ("bpf: prevent out of bounds speculation on pointer
      arithmetic") took care of rejecting alu op on pointer when e.g. pointer
      came from two different map values with different map properties such as
      value size, Jann reported that a case was not covered yet when a given
      alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
      different branches where we would incorrectly try to sanitize based
      on the pointer's limit. Catch this corner case and reject the program
      instead.
      
      Fixes: 979d63d5 ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Reported-by: 's avatarJann Horn <jannh@google.com>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      4bce22c3
    • Daniel Borkmann's avatar
      bpf: prevent out of bounds speculation on pointer arithmetic · 078da99d
      Daniel Borkmann authored
      [ commit 979d63d5 upstream ]
      
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
      example:
      
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      speculation.
      
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
      
        PTR += 0x1000 (e.g. K-based imm)
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        PTR += 0x1000
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        [...]
      
      ... which under speculation could end up as ...
      
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        [...]
      
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
      
        # bpftool prog dump xlated id 282
        [...]
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
        [...]
      
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
      
        [...]
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
        [...]
      
      JIT blinding example with non-conflicting use of r10:
      
        [...]
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
       [...]
      
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
      
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
            https://arxiv.org/pdf/1801.04084.pdf
      
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
            https://arxiv.org/pdf/1811.05441.pdf
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: 's avatarJann Horn <jannh@google.com>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      078da99d
    • Daniel Borkmann's avatar
      bpf: enable access to ax register also from verifier rewrite · 74d3c044
      Daniel Borkmann authored
      [ commit 9b73bfdd upstream ]
      
      Right now we are using BPF ax register in JIT for constant blinding as
      well as in interpreter as temporary variable. Verifier will not be able
      to use it simply because its use will get overridden from the former in
      bpf_jit_blind_insn(). However, it can be made to work in that blinding
      will be skipped if there is prior use in either source or destination
      register on the instruction. Taking constraints of ax into account, the
      verifier is then open to use it in rewrites under some constraints. Note,
      ax register already has mappings in every eBPF JIT.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      74d3c044
    • Daniel Borkmann's avatar
      bpf: move tmp variable into ax register in interpreter · 433303ac
      Daniel Borkmann authored
      [ commit 144cd91c upstream ]
      
      This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
      into the hidden ax register. The latter is currently only used in JITs
      for constant blinding as a temporary scratch register, meaning the BPF
      interpreter will never see the use of ax. Therefore it is safe to use
      it for the cases where tmp has been used earlier. This is needed to later
      on allow restricted hidden use of ax in both interpreter and JITs.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      433303ac
    • Daniel Borkmann's avatar
      bpf: move {prev_,}insn_idx into verifier env · 629b8af1
      Daniel Borkmann authored
      [ commit c08435ec upstream ]
      
      Move prev_insn_idx and insn_idx from the do_check() function into
      the verifier environment, so they can be read inside the various
      helper functions for handling the instructions. It's easier to put
      this into the environment rather than changing all call-sites only
      to pass it along. insn_idx is useful in particular since this later
      on allows to hold state in env->insn_aux_data[env->insn_idx].
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      629b8af1
    • Deepa Dinamani's avatar
      Input: input_event - provide override for sparc64 · 16669bbb
      Deepa Dinamani authored
      commit 2e746942 upstream.
      
      The usec part of the timeval is defined as
      __kernel_suseconds_t	tv_usec; /* microseconds */
      
      Arnd noticed that sparc64 is the only architecture that defines
      __kernel_suseconds_t as int rather than long.
      
      This breaks the current y2038 fix for kernel as we only access and define
      the timeval struct for non-kernel use cases.  But, this was hidden by an
      another typo in the use of __KERNEL__ qualifier.
      
      Fix the typo, and provide an override for sparc64.
      
      Fixes: 152194fe ("Input: extend usable life of event timestamps to 2106 on 32 bit systems")
      Reported-by: 's avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: 's avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16669bbb
    • Dexuan Cui's avatar
      Drivers: hv: vmbus: Check for ring when getting debug info · 7e66208d
      Dexuan Cui authored
      commit ba50bf1c upstream.
      
      fc96df16 is good and can already fix the "return stack garbage" issue,
      but let's also improve hv_ringbuffer_get_debuginfo(), which would silently
      return stack garbage, if people forget to check channel->state or
      ring_info->ring_buffer, when using the function in the future.
      
      Having an error check in the function would eliminate the potential risk.
      
      Add a Fixes tag to indicate the patch depdendency.
      
      Fixes: fc96df16 ("Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels")
      Cc: stable@vger.kernel.org
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: 's avatarStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: 's avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7e66208d