1. 19 Mar, 2019 6 commits
    • Guillaume Nault's avatar
      tcp: handle inet_csk_reqsk_queue_add() failures · 3460bb19
      Guillaume Nault authored
      [  Upstream commit 9d3e1368 ]
      
      Commit 7716682c ("tcp/dccp: fix another race at listener
      dismantle") let inet_csk_reqsk_queue_add() fail, and adjusted
      {tcp,dccp}_check_req() accordingly. However, TFO and syncookies
      weren't modified, thus leaking allocated resources on error.
      
      Contrary to tcp_check_req(), in both syncookies and TFO cases,
      we need to drop the request socket. Also, since the child socket is
      created with inet_csk_clone_lock(), we have to unlock it and drop an
      extra reference (->sk_refcount is initially set to 2 and
      inet_csk_reqsk_queue_add() drops only one ref).
      
      For TFO, we also need to revert the work done by tcp_try_fastopen()
      (with reqsk_fastopen_remove()).
      
      Fixes: 7716682c ("tcp/dccp: fix another race at listener dismantle")
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3460bb19
    • Christoph Paasch's avatar
      tcp: Don't access TCP_SKB_CB before initializing it · dfcf44d2
      Christoph Paasch authored
      [ Upstream commit f2feaefd ]
      
      Since commit eeea10b8 ("tcp: add
      tcp_v4_fill_cb()/tcp_v4_restore_cb()"), tcp_vX_fill_cb is only called
      after tcp_filter(). That means, TCP_SKB_CB(skb)->end_seq still points to
      the IP-part of the cb.
      
      We thus should not mock with it, as this can trigger bugs (thanks
      syzkaller):
      [   12.349396] ==================================================================
      [   12.350188] BUG: KASAN: slab-out-of-bounds in ip6_datagram_recv_specific_ctl+0x19b3/0x1a20
      [   12.351035] Read of size 1 at addr ffff88006adbc208 by task test_ip6_datagr/1799
      
      Setting end_seq is actually no more necessary in tcp_filter as it gets
      initialized later on in tcp_vX_fill_cb.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: eeea10b8 ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
      Signed-off-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dfcf44d2
    • Soheil Hassas Yeganeh's avatar
      tcp: do not report TCP_CM_INQ of 0 for closed connections · 75c9b039
      Soheil Hassas Yeganeh authored
      [ Upstream commit 6466e715 ]
      
      Returning 0 as inq to userspace indicates there is no more data to
      read, and the application needs to wait for EPOLLIN. For a connection
      that has received FIN from the remote peer, however, the application
      must continue reading until getting EOF (return value of 0
      from tcp_recvmsg) or an error, if edge-triggered epoll (EPOLLET) is
      being used. Otherwise, the application will never receive a new
      EPOLLIN, since there is no epoll edge after the FIN.
      
      Return 1 when there is no data left on the queue but the
      connection has received FIN, so that the applications continue
      reading.
      
      Fixes: b75eba76 (tcp: send in-queue bytes in cmsg upon read)
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75c9b039
    • Xin Long's avatar
      route: set the deleted fnhe fnhe_daddr to 0 in ip_del_fnhe to fix a race · ed98b01c
      Xin Long authored
      [ Upstream commit ee60ad21 ]
      
      The race occurs in __mkroute_output() when 2 threads lookup a dst:
      
        CPU A                 CPU B
        find_exception()
                              find_exception() [fnhe expires]
                              ip_del_fnhe() [fnhe is deleted]
        rt_bind_exception()
      
      In rt_bind_exception() it will bind a deleted fnhe with the new dst, and
      this dst will get no chance to be freed. It causes a dev defcnt leak and
      consecutive dmesg warnings:
      
        unregister_netdevice: waiting for ethX to become free. Usage count = 1
      
      Especially thanks Jon to identify the issue.
      
      This patch fixes it by setting fnhe_daddr to 0 in ip_del_fnhe() to stop
      binding the deleted fnhe with a new dst when checking fnhe's fnhe_daddr
      and daddr in rt_bind_exception().
      
      It works as both ip_del_fnhe() and rt_bind_exception() are protected by
      fnhe_lock and the fhne is freed by kfree_rcu().
      
      Fixes: deed49df ("route: check and remove route cache when we get route")
      Signed-off-by: default avatarJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed98b01c
    • Paolo Abeni's avatar
      ipv4/route: fail early when inet dev is missing · b41988c2
      Paolo Abeni authored
      [ Upstream commit 22c74764 ]
      
      If a non local multicast packet reaches ip_route_input_rcu() while
      the ingress device IPv4 private data (in_dev) is NULL, we end up
      doing a NULL pointer dereference in IN_DEV_MFORWARD().
      
      Since the later call to ip_route_input_mc() is going to fail if
      !in_dev, we can fail early in such scenario and avoid the dangerous
      code path.
      
      v1 -> v2:
       - clarified the commit message, no code changes
      Reported-by: default avatarTianhao Zhao <tizhao@redhat.com>
      Fixes: e58e4159 ("net: Enable support for VRF with ipv4 multicast")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b41988c2
    • Eric Dumazet's avatar
      fou, fou6: avoid uninit-value in gue_err() and gue6_err() · a9b0ebbf
      Eric Dumazet authored
      [ Upstream commit 5355ed63 ]
      
      My prior commit missed the fact that these functions
      were using udp_hdr() (aka skb_transport_header())
      to get access to GUE header.
      
      Since pskb_transport_may_pull() does not exist yet, we have to add
      transport_offset to our pskb_may_pull() calls.
      
      BUG: KMSAN: uninit-value in gue_err+0x514/0xfa0 net/ipv4/fou.c:1032
      CPU: 1 PID: 10648 Comm: syz-executor.1 Not tainted 5.0.0+ #11
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x173/0x1d0 lib/dump_stack.c:113
       kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:600
       __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:313
       gue_err+0x514/0xfa0 net/ipv4/fou.c:1032
       __udp4_lib_err_encap_no_sk net/ipv4/udp.c:571 [inline]
       __udp4_lib_err_encap net/ipv4/udp.c:626 [inline]
       __udp4_lib_err+0x12e6/0x1d40 net/ipv4/udp.c:665
       udp_err+0x74/0x90 net/ipv4/udp.c:737
       icmp_socket_deliver net/ipv4/icmp.c:767 [inline]
       icmp_unreach+0xb65/0x1070 net/ipv4/icmp.c:884
       icmp_rcv+0x11a1/0x1950 net/ipv4/icmp.c:1066
       ip_protocol_deliver_rcu+0x584/0xbb0 net/ipv4/ip_input.c:208
       ip_local_deliver_finish net/ipv4/ip_input.c:234 [inline]
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ip_local_deliver+0x624/0x7b0 net/ipv4/ip_input.c:255
       dst_input include/net/dst.h:450 [inline]
       ip_rcv_finish net/ipv4/ip_input.c:414 [inline]
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ip_rcv+0x6bd/0x740 net/ipv4/ip_input.c:524
       __netif_receive_skb_one_core net/core/dev.c:4973 [inline]
       __netif_receive_skb net/core/dev.c:5083 [inline]
       process_backlog+0x756/0x10e0 net/core/dev.c:5923
       napi_poll net/core/dev.c:6346 [inline]
       net_rx_action+0x78b/0x1a60 net/core/dev.c:6412
       __do_softirq+0x53f/0x93a kernel/softirq.c:293
       invoke_softirq kernel/softirq.c:375 [inline]
       irq_exit+0x214/0x250 kernel/softirq.c:416
       exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:536
       smp_apic_timer_interrupt+0x48/0x70 arch/x86/kernel/apic/apic.c:1064
       apic_timer_interrupt+0x2e/0x40 arch/x86/entry/entry_64.S:814
       </IRQ>
      RIP: 0010:finish_lock_switch+0x2b/0x40 kernel/sched/core.c:2597
      Code: 48 89 e5 53 48 89 fb e8 63 e7 95 00 8b b8 88 0c 00 00 48 8b 00 48 85 c0 75 12 48 89 df e8 dd db 95 00 c6 00 00 c6 03 00 fb 5b <5d> c3 e8 4e e6 95 00 eb e7 66 90 66 2e 0f 1f 84 00 00 00 00 00 55
      RSP: 0018:ffff888081a0fc80 EFLAGS: 00000296 ORIG_RAX: ffffffffffffff13
      RAX: ffff88821fd6bd80 RBX: ffff888027898000 RCX: ccccccccccccd000
      RDX: ffff88821fca8d80 RSI: ffff888000000000 RDI: 00000000000004a0
      RBP: ffff888081a0fc80 R08: 0000000000000002 R09: ffff888081a0fb08
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
      R13: ffff88811130e388 R14: ffff88811130da00 R15: ffff88812fdb7d80
       finish_task_switch+0xfc/0x2d0 kernel/sched/core.c:2698
       context_switch kernel/sched/core.c:2851 [inline]
       __schedule+0x6cc/0x800 kernel/sched/core.c:3491
       schedule+0x15b/0x240 kernel/sched/core.c:3535
       freezable_schedule include/linux/freezer.h:172 [inline]
       do_nanosleep+0x2ba/0x980 kernel/time/hrtimer.c:1679
       hrtimer_nanosleep kernel/time/hrtimer.c:1733 [inline]
       __do_sys_nanosleep kernel/time/hrtimer.c:1767 [inline]
       __se_sys_nanosleep+0x746/0x960 kernel/time/hrtimer.c:1754
       __x64_sys_nanosleep+0x3e/0x60 kernel/time/hrtimer.c:1754
       do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      RIP: 0033:0x4855a0
      Code: 00 00 48 c7 c0 d4 ff ff ff 64 c7 00 16 00 00 00 31 c0 eb be 66 0f 1f 44 00 00 83 3d b1 11 5d 00 00 75 14 b8 23 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 04 e2 f8 ff c3 48 83 ec 08 e8 3a 55 fd ff
      RSP: 002b:0000000000a4fd58 EFLAGS: 00000246 ORIG_RAX: 0000000000000023
      RAX: ffffffffffffffda RBX: 0000000000085780 RCX: 00000000004855a0
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000a4fd60
      RBP: 00000000000007ec R08: 0000000000000001 R09: 0000000000ceb940
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000008
      R13: 0000000000a4fdb0 R14: 0000000000085711 R15: 0000000000a4fdc0
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:205 [inline]
       kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:159
       kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
       kmsan_slab_alloc+0xe/0x10 mm/kmsan/kmsan_hooks.c:185
       slab_post_alloc_hook mm/slab.h:445 [inline]
       slab_alloc_node mm/slub.c:2773 [inline]
       __kmalloc_node_track_caller+0xe9e/0xff0 mm/slub.c:4398
       __kmalloc_reserve net/core/skbuff.c:140 [inline]
       __alloc_skb+0x309/0xa20 net/core/skbuff.c:208
       alloc_skb include/linux/skbuff.h:1012 [inline]
       alloc_skb_with_frags+0x186/0xa60 net/core/skbuff.c:5287
       sock_alloc_send_pskb+0xafd/0x10a0 net/core/sock.c:2091
       sock_alloc_send_skb+0xca/0xe0 net/core/sock.c:2108
       __ip_append_data+0x34cd/0x5000 net/ipv4/ip_output.c:998
       ip_append_data+0x324/0x480 net/ipv4/ip_output.c:1220
       icmp_push_reply+0x23d/0x7e0 net/ipv4/icmp.c:375
       __icmp_send+0x2ea3/0x30f0 net/ipv4/icmp.c:737
       icmp_send include/net/icmp.h:47 [inline]
       ipv4_link_failure+0x6d/0x230 net/ipv4/route.c:1190
       dst_link_failure include/net/dst.h:427 [inline]
       arp_error_report+0x106/0x1a0 net/ipv4/arp.c:297
       neigh_invalidate+0x359/0x8e0 net/core/neighbour.c:992
       neigh_timer_handler+0xdf2/0x1280 net/core/neighbour.c:1078
       call_timer_fn+0x285/0x600 kernel/time/timer.c:1325
       expire_timers kernel/time/timer.c:1362 [inline]
       __run_timers+0xdb4/0x11d0 kernel/time/timer.c:1681
       run_timer_softirq+0x2e/0x50 kernel/time/timer.c:1694
       __do_softirq+0x53f/0x93a kernel/softirq.c:293
      
      Fixes: 26fc181e ("fou, fou6: do not assume linear skbs")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Acked-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a9b0ebbf
  2. 02 Mar, 2019 1 commit
  3. 28 Feb, 2019 2 commits
  4. 26 Feb, 2019 1 commit
    • David Ahern's avatar
      ipv4: Return error for RTA_VIA attribute · b6e9e5df
      David Ahern authored
      IPv4 currently does not support nexthops outside of the AF_INET family.
      Specifically, it does not handle RTA_VIA attribute. If it is passed
      in a route add request, the actual route added only uses the device
      which is clearly not what the user intended:
      
        $ ip ro add 172.16.1.0/24 via inet6 2001:db8:1::1 dev eth0
        $ ip ro ls
        ...
        172.16.1.0/24 dev eth0
      
      Catch this and fail the route add:
        $ ip ro add 172.16.1.0/24 via inet6 2001:db8:1::1 dev eth0
        Error: IPv4 does not support RTA_VIA attribute.
      
      Fixes: 03c05665 ("mpls: Netlink commands to add, remove, and dump routes")
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6e9e5df
  5. 25 Feb, 2019 2 commits
  6. 24 Feb, 2019 1 commit
    • Eric Dumazet's avatar
      tcp: repaired skbs must init their tso_segs · bf50b606
      Eric Dumazet authored
      syzbot reported a WARN_ON(!tcp_skb_pcount(skb))
      in tcp_send_loss_probe() [1]
      
      This was caused by TCP_REPAIR sent skbs that inadvertenly
      were missing a call to tcp_init_tso_segs()
      
      [1]
      WARNING: CPU: 1 PID: 0 at net/ipv4/tcp_output.c:2534 tcp_send_loss_probe+0x771/0x8a0 net/ipv4/tcp_output.c:2534
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-rc7+ #77
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       panic+0x2cb/0x65c kernel/panic.c:214
       __warn.cold+0x20/0x45 kernel/panic.c:571
       report_bug+0x263/0x2b0 lib/bug.c:186
       fixup_bug arch/x86/kernel/traps.c:178 [inline]
       fixup_bug arch/x86/kernel/traps.c:173 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:290
       invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
      RIP: 0010:tcp_send_loss_probe+0x771/0x8a0 net/ipv4/tcp_output.c:2534
      Code: 88 fc ff ff 4c 89 ef e8 ed 75 c8 fb e9 c8 fc ff ff e8 43 76 c8 fb e9 63 fd ff ff e8 d9 75 c8 fb e9 94 f9 ff ff e8 bf 03 91 fb <0f> 0b e9 7d fa ff ff e8 b3 03 91 fb 0f b6 1d 37 43 7a 03 31 ff 89
      RSP: 0018:ffff8880ae907c60 EFLAGS: 00010206
      RAX: ffff8880a989c340 RBX: 0000000000000000 RCX: ffffffff85dedbdb
      RDX: 0000000000000100 RSI: ffffffff85dee0b1 RDI: 0000000000000005
      RBP: ffff8880ae907c90 R08: ffff8880a989c340 R09: ffffed10147d1ae1
      R10: ffffed10147d1ae0 R11: ffff8880a3e8d703 R12: ffff888091b90040
      R13: ffff8880a3e8d540 R14: 0000000000008000 R15: ffff888091b90860
       tcp_write_timer_handler+0x5c0/0x8a0 net/ipv4/tcp_timer.c:583
       tcp_write_timer+0x10e/0x1d0 net/ipv4/tcp_timer.c:607
       call_timer_fn+0x190/0x720 kernel/time/timer.c:1325
       expire_timers kernel/time/timer.c:1362 [inline]
       __run_timers kernel/time/timer.c:1681 [inline]
       __run_timers kernel/time/timer.c:1649 [inline]
       run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694
       __do_softirq+0x266/0x95a kernel/softirq.c:292
       invoke_softirq kernel/softirq.c:373 [inline]
       irq_exit+0x180/0x1d0 kernel/softirq.c:413
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807
       </IRQ>
      RIP: 0010:native_safe_halt+0x2/0x10 arch/x86/include/asm/irqflags.h:58
      Code: ff ff ff 48 89 c7 48 89 45 d8 e8 59 0c a1 fa 48 8b 45 d8 e9 ce fe ff ff 48 89 df e8 48 0c a1 fa eb 82 90 90 90 90 90 90 fb f4 <c3> 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 90 90 90 90 90 90
      RSP: 0018:ffff8880a98afd78 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
      RAX: 1ffffffff1125061 RBX: ffff8880a989c340 RCX: 0000000000000000
      RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffff8880a989cbbc
      RBP: ffff8880a98afda8 R08: ffff8880a989c340 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
      R13: ffffffff889282f8 R14: 0000000000000001 R15: 0000000000000000
       arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:555
       default_idle_call+0x36/0x90 kernel/sched/idle.c:93
       cpuidle_idle_call kernel/sched/idle.c:153 [inline]
       do_idle+0x386/0x570 kernel/sched/idle.c:262
       cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:353
       start_secondary+0x404/0x5c0 arch/x86/kernel/smpboot.c:271
       secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 79861919 ("tcp: fix TCP_REPAIR xmit queue setup")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf50b606
  7. 23 Feb, 2019 1 commit
  8. 22 Feb, 2019 1 commit
    • Lorenzo Bianconi's avatar
      net: ip_gre: do not report erspan_ver for gre or gretap · 2bdf700e
      Lorenzo Bianconi authored
      Report erspan version field to userspace in ipgre_fill_info just for
      erspan tunnels. The issue can be triggered with the following reproducer:
      
      $ip link add name gre1 type gre local 192.168.0.1 remote 192.168.1.1
      $ip link set dev gre1 up
      $ip -d link sh gre1
      13: gre1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1476 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
          link/gre 192.168.0.1 peer 192.168.1.1 promiscuity 0 minmtu 0 maxmtu 0
          gre remote 192.168.1.1 local 192.168.0.1 ttl inherit erspan_ver 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1
      
      Fixes: f551c91d ("net: erspan: introduce erspan v2 for ip_gre")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2bdf700e
  9. 17 Feb, 2019 2 commits
  10. 12 Feb, 2019 1 commit
    • Konstantin Khlebnikov's avatar
      inet_diag: fix reporting cgroup classid and fallback to priority · 1ec17dbd
      Konstantin Khlebnikov authored
      Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested
      extensions has only 8 bits. Thus extensions starting from DCTCPINFO
      cannot be requested directly. Some of them included into response
      unconditionally or hook into some of lower 8 bits.
      
      Extension INET_DIAG_CLASS_ID has not way to request from the beginning.
      
      This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space
      reservation, and documents behavior for other extensions.
      
      Also this patch adds fallback to reporting socket priority. This filed
      is more widely used for traffic classification because ipv4 sockets
      automatically maps TOS to priority and default qdisc pfifo_fast knows
      about that. But priority could be changed via setsockopt SO_PRIORITY so
      INET_DIAG_TOS isn't enough for predicting class.
      
      Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot
      reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen).
      
      So, after this patch INET_DIAG_CLASS_ID will report socket priority
      for most common setup when net_cls isn't set and/or cgroup2 in use.
      
      Fixes: 0888e372 ("net: inet: diag: expose sockets cgroup classid")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ec17dbd
  11. 11 Feb, 2019 2 commits
  12. 09 Feb, 2019 1 commit
    • Lorenzo Bianconi's avatar
      net: ipv4: use a dedicated counter for icmp_v4 redirect packets · c09551c6
      Lorenzo Bianconi authored
      According to the algorithm described in the comment block at the
      beginning of ip_rt_send_redirect, the host should try to send
      'ip_rt_redirect_number' ICMP redirect packets with an exponential
      backoff and then stop sending them at all assuming that the destination
      ignores redirects.
      If the device has previously sent some ICMP error packets that are
      rate-limited (e.g TTL expired) and continues to receive traffic,
      the redirect packets will never be transmitted. This happens since
      peer->rate_tokens will be typically greater than 'ip_rt_redirect_number'
      and so it will never be reset even if the redirect silence timeout
      (ip_rt_redirect_silence) has elapsed without receiving any packet
      requiring redirects.
      
      Fix it by using a dedicated counter for the number of ICMP redirect
      packets that has been sent by the host
      
      I have not been able to identify a given commit that introduced the
      issue since ip_rt_send_redirect implements the same rate-limiting
      algorithm from commit 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c09551c6
  13. 30 Jan, 2019 1 commit
    • Lorenzo Bianconi's avatar
      net: ip_gre: always reports o_key to userspace · feaf5c79
      Lorenzo Bianconi authored
      Erspan protocol (version 1 and 2) relies on o_key to configure
      session id header field. However TUNNEL_KEY bit is cleared in
      erspan_xmit since ERSPAN protocol does not set the key field
      of the external GRE header and so the configured o_key is not reported
      to userspace. The issue can be triggered with the following reproducer:
      
      $ip link add erspan1 type erspan local 192.168.0.1 remote 192.168.0.2 \
          key 1 seq erspan_ver 1
      $ip link set erspan1 up
      $ip -d link sh erspan1
      
      erspan1@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UNKNOWN mode DEFAULT
        link/ether 52:aa:99:95:9a:b5 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500
        erspan remote 192.168.0.2 local 192.168.0.1 ttl inherit ikey 0.0.0.1 iseq oseq erspan_index 0
      
      Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in
      ipgre_fill_info
      
      Fixes: 84e54fe0 ("gre: introduce native tunnel support for ERSPAN")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      feaf5c79
  14. 28 Jan, 2019 2 commits
    • Martin Willi's avatar
      esp: Skip TX bytes accounting when sending from a request socket · 09db5124
      Martin Willi authored
      On ESP output, sk_wmem_alloc is incremented for the added padding if a
      socket is associated to the skb. When replying with TCP SYNACKs over
      IPsec, the associated sk is a casted request socket, only. Increasing
      sk_wmem_alloc on a request socket results in a write at an arbitrary
      struct offset. In the best case, this produces the following WARNING:
      
      WARNING: CPU: 1 PID: 0 at lib/refcount.c:102 esp_output_head+0x2e4/0x308 [esp4]
      refcount_t: addition on 0; use-after-free.
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-rc3 #2
      Hardware name: Marvell Armada 380/385 (Device Tree)
      [...]
      [<bf0ff354>] (esp_output_head [esp4]) from [<bf1006a4>] (esp_output+0xb8/0x180 [esp4])
      [<bf1006a4>] (esp_output [esp4]) from [<c05dee64>] (xfrm_output_resume+0x558/0x664)
      [<c05dee64>] (xfrm_output_resume) from [<c05d07b0>] (xfrm4_output+0x44/0xc4)
      [<c05d07b0>] (xfrm4_output) from [<c05956bc>] (tcp_v4_send_synack+0xa8/0xe8)
      [<c05956bc>] (tcp_v4_send_synack) from [<c0586ad8>] (tcp_conn_request+0x7f4/0x948)
      [<c0586ad8>] (tcp_conn_request) from [<c058c404>] (tcp_rcv_state_process+0x2a0/0xe64)
      [<c058c404>] (tcp_rcv_state_process) from [<c05958ac>] (tcp_v4_do_rcv+0xf0/0x1f4)
      [<c05958ac>] (tcp_v4_do_rcv) from [<c0598a4c>] (tcp_v4_rcv+0xdb8/0xe20)
      [<c0598a4c>] (tcp_v4_rcv) from [<c056eb74>] (ip_protocol_deliver_rcu+0x2c/0x2dc)
      [<c056eb74>] (ip_protocol_deliver_rcu) from [<c056ee6c>] (ip_local_deliver_finish+0x48/0x54)
      [<c056ee6c>] (ip_local_deliver_finish) from [<c056eecc>] (ip_local_deliver+0x54/0xec)
      [<c056eecc>] (ip_local_deliver) from [<c056efac>] (ip_rcv+0x48/0xb8)
      [<c056efac>] (ip_rcv) from [<c0519c2c>] (__netif_receive_skb_one_core+0x50/0x6c)
      [...]
      
      The issue triggers only when not using TCP syncookies, as for syncookies
      no socket is associated.
      
      Fixes: cac2661c ("esp4: Avoid skb_cow_data whenever possible")
      Fixes: 03e2a30f ("esp6: Avoid skb_cow_data whenever possible")
      Signed-off-by: default avatarMartin Willi <martin@strongswan.org>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      09db5124
    • Anders Roxell's avatar
      netfilter: ipt_CLUSTERIP: fix warning unused variable cn · 206b8cc5
      Anders Roxell authored
      When CONFIG_PROC_FS isn't set the variable cn isn't used.
      
      net/ipv4/netfilter/ipt_CLUSTERIP.c: In function ‘clusterip_net_exit’:
      net/ipv4/netfilter/ipt_CLUSTERIP.c:849:24: warning: unused variable ‘cn’ [-Wunused-variable]
        struct clusterip_net *cn = clusterip_pernet(net);
                              ^~
      
      Rework so the variable 'cn' is declared inside "#ifdef CONFIG_PROC_FS".
      
      Fixes: b12f7bad ("netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine")
      Signed-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      206b8cc5
  15. 25 Jan, 2019 1 commit
  16. 22 Jan, 2019 1 commit
  17. 18 Jan, 2019 1 commit
  18. 17 Jan, 2019 1 commit
  19. 16 Jan, 2019 4 commits
    • Willem de Bruijn's avatar
      udp: with udp_segment release on error path · 0f149c9f
      Willem de Bruijn authored
      Failure __ip_append_data triggers udp_flush_pending_frames, but these
      tests happen later. The skb must be freed directly.
      
      Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f149c9f
    • Xin Long's avatar
      erspan: build the header with the right proto according to erspan_ver · 20704bd1
      Xin Long authored
      As said in draft-foschiano-erspan-03#section4:
      
         Different frame variants known as "ERSPAN Types" can be
         distinguished based on the GRE "Protocol Type" field value: Type I
         and II's value is 0x88BE while Type III's is 0x22EB [ETYPES].
      
      So set it properly in erspan_xmit() according to erspan_ver. While at
      it, also remove the unused parameter 'proto' in erspan_fb_xmit().
      
      Fixes: 94d7d8f2 ("ip6_gre: add erspan v2 support")
      Reported-by: default avatarJianlin Shi <jishi@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20704bd1
    • Eric Dumazet's avatar
      fou, fou6: do not assume linear skbs · 26fc181e
      Eric Dumazet authored
      Both gue_err() and gue6_err() incorrectly assume
      linear skbs. Fix them to use pskb_may_pull().
      
      BUG: KMSAN: uninit-value in gue6_err+0x475/0xc40 net/ipv6/fou6.c:101
      CPU: 0 PID: 18083 Comm: syz-executor1 Not tainted 5.0.0-rc1+ #7
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x173/0x1d0 lib/dump_stack.c:113
       kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:600
       __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:313
       gue6_err+0x475/0xc40 net/ipv6/fou6.c:101
       __udp6_lib_err_encap_no_sk net/ipv6/udp.c:434 [inline]
       __udp6_lib_err_encap net/ipv6/udp.c:491 [inline]
       __udp6_lib_err+0x18d0/0x2590 net/ipv6/udp.c:522
       udplitev6_err+0x118/0x130 net/ipv6/udplite.c:27
       icmpv6_notify+0x462/0x9f0 net/ipv6/icmp.c:784
       icmpv6_rcv+0x18ac/0x3fa0 net/ipv6/icmp.c:872
       ip6_protocol_deliver_rcu+0xb5a/0x23a0 net/ipv6/ip6_input.c:394
       ip6_input_finish net/ipv6/ip6_input.c:434 [inline]
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ip6_input+0x2b6/0x350 net/ipv6/ip6_input.c:443
       dst_input include/net/dst.h:450 [inline]
       ip6_rcv_finish+0x4e7/0x6d0 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ipv6_rcv+0x34b/0x3f0 net/ipv6/ip6_input.c:272
       __netif_receive_skb_one_core net/core/dev.c:4973 [inline]
       __netif_receive_skb net/core/dev.c:5083 [inline]
       process_backlog+0x756/0x10e0 net/core/dev.c:5923
       napi_poll net/core/dev.c:6346 [inline]
       net_rx_action+0x78b/0x1a60 net/core/dev.c:6412
       __do_softirq+0x53f/0x93a kernel/softirq.c:293
       do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1039
       </IRQ>
       do_softirq kernel/softirq.c:338 [inline]
       __local_bh_enable_ip+0x16f/0x1a0 kernel/softirq.c:190
       local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
       rcu_read_unlock_bh include/linux/rcupdate.h:696 [inline]
       ip6_finish_output2+0x1d64/0x25f0 net/ipv6/ip6_output.c:121
       ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:154
       NF_HOOK_COND include/linux/netfilter.h:278 [inline]
       ip6_output+0x5ca/0x710 net/ipv6/ip6_output.c:171
       dst_output include/net/dst.h:444 [inline]
       ip6_local_out+0x164/0x1d0 net/ipv6/output_core.c:176
       ip6_send_skb+0xfa/0x390 net/ipv6/ip6_output.c:1727
       udp_v6_send_skb+0x1733/0x1d20 net/ipv6/udp.c:1169
       udpv6_sendmsg+0x424e/0x45d0 net/ipv6/udp.c:1466
       inet_sendmsg+0x54a/0x720 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg net/socket.c:631 [inline]
       ___sys_sendmsg+0xdb9/0x11b0 net/socket.c:2116
       __sys_sendmmsg+0x580/0xad0 net/socket.c:2211
       __do_sys_sendmmsg net/socket.c:2240 [inline]
       __se_sys_sendmmsg+0xbd/0xe0 net/socket.c:2237
       __x64_sys_sendmmsg+0x56/0x70 net/socket.c:2237
       do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      RIP: 0033:0x457ec9
      Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f4a5204fc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 0000000000457ec9
      RDX: 00000000040001ab RSI: 0000000020000240 RDI: 0000000000000003
      RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4a520506d4
      R13: 00000000004c4ce5 R14: 00000000004d85d8 R15: 00000000ffffffff
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:205 [inline]
       kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:159
       kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
       kmsan_slab_alloc+0xe/0x10 mm/kmsan/kmsan_hooks.c:185
       slab_post_alloc_hook mm/slab.h:446 [inline]
       slab_alloc_node mm/slub.c:2754 [inline]
       __kmalloc_node_track_caller+0xe9e/0xff0 mm/slub.c:4377
       __kmalloc_reserve net/core/skbuff.c:140 [inline]
       __alloc_skb+0x309/0xa20 net/core/skbuff.c:208
       alloc_skb include/linux/skbuff.h:1012 [inline]
       alloc_skb_with_frags+0x1c7/0xac0 net/core/skbuff.c:5288
       sock_alloc_send_pskb+0xafd/0x10a0 net/core/sock.c:2091
       sock_alloc_send_skb+0xca/0xe0 net/core/sock.c:2108
       __ip6_append_data+0x42ed/0x5dc0 net/ipv6/ip6_output.c:1443
       ip6_append_data+0x3c2/0x650 net/ipv6/ip6_output.c:1619
       icmp6_send+0x2f5c/0x3c40 net/ipv6/icmp.c:574
       icmpv6_send+0xe5/0x110 net/ipv6/ip6_icmp.c:43
       ip6_link_failure+0x5c/0x2c0 net/ipv6/route.c:2231
       dst_link_failure include/net/dst.h:427 [inline]
       vti_xmit net/ipv4/ip_vti.c:229 [inline]
       vti_tunnel_xmit+0xf3b/0x1ea0 net/ipv4/ip_vti.c:265
       __netdev_start_xmit include/linux/netdevice.h:4382 [inline]
       netdev_start_xmit include/linux/netdevice.h:4391 [inline]
       xmit_one net/core/dev.c:3278 [inline]
       dev_hard_start_xmit+0x604/0xc40 net/core/dev.c:3294
       __dev_queue_xmit+0x2e48/0x3b80 net/core/dev.c:3864
       dev_queue_xmit+0x4b/0x60 net/core/dev.c:3897
       neigh_direct_output+0x42/0x50 net/core/neighbour.c:1511
       neigh_output include/net/neighbour.h:508 [inline]
       ip6_finish_output2+0x1d4e/0x25f0 net/ipv6/ip6_output.c:120
       ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:154
       NF_HOOK_COND include/linux/netfilter.h:278 [inline]
       ip6_output+0x5ca/0x710 net/ipv6/ip6_output.c:171
       dst_output include/net/dst.h:444 [inline]
       ip6_local_out+0x164/0x1d0 net/ipv6/output_core.c:176
       ip6_send_skb+0xfa/0x390 net/ipv6/ip6_output.c:1727
       udp_v6_send_skb+0x1733/0x1d20 net/ipv6/udp.c:1169
       udpv6_sendmsg+0x424e/0x45d0 net/ipv6/udp.c:1466
       inet_sendmsg+0x54a/0x720 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg net/socket.c:631 [inline]
       ___sys_sendmsg+0xdb9/0x11b0 net/socket.c:2116
       __sys_sendmmsg+0x580/0xad0 net/socket.c:2211
       __do_sys_sendmmsg net/socket.c:2240 [inline]
       __se_sys_sendmmsg+0xbd/0xe0 net/socket.c:2237
       __x64_sys_sendmmsg+0x56/0x70 net/socket.c:2237
       do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      
      Fixes: b8a51b38 ("fou, fou6: ICMP error handlers for FoU and GUE")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26fc181e
    • Willem de Bruijn's avatar
      tcp: allow MSG_ZEROCOPY transmission also in CLOSE_WAIT state · 13d7f463
      Willem de Bruijn authored
      TCP transmission with MSG_ZEROCOPY fails if the peer closes its end of
      the connection and so transitions this socket to CLOSE_WAIT state.
      
      Transmission in close wait state is acceptable. Other similar tests in
      the stack (e.g., in FastOpen) accept both states. Relax this test, too.
      
      Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg276886.html
      Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg227390.html
      Fixes: f214f915 ("tcp: enable MSG_ZEROCOPY")
      Reported-by: default avatarMarek Majkowski <marek@cloudflare.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      CC: Yuchung Cheng <ycheng@google.com>
      CC: Neal Cardwell <ncardwell@google.com>
      CC: Soheil Hassas Yeganeh <soheil@google.com>
      CC: Alexey Kodanev <alexey.kodanev@oracle.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      13d7f463
  20. 15 Jan, 2019 1 commit
    • Ido Schimmel's avatar
      net: ipv4: Fix memory leak in network namespace dismantle · f97f4dd8
      Ido Schimmel authored
      IPv4 routing tables are flushed in two cases:
      
      1. In response to events in the netdev and inetaddr notification chains
      2. When a network namespace is being dismantled
      
      In both cases only routes associated with a dead nexthop group are
      flushed. However, a nexthop group will only be marked as dead in case it
      is populated with actual nexthops using a nexthop device. This is not
      the case when the route in question is an error route (e.g.,
      'blackhole', 'unreachable').
      
      Therefore, when a network namespace is being dismantled such routes are
      not flushed and leaked [1].
      
      To reproduce:
      # ip netns add blue
      # ip -n blue route add unreachable 192.0.2.0/24
      # ip netns del blue
      
      Fix this by not skipping error routes that are not marked with
      RTNH_F_DEAD when flushing the routing tables.
      
      To prevent the flushing of such routes in case #1, add a parameter to
      fib_table_flush() that indicates if the table is flushed as part of
      namespace dismantle or not.
      
      Note that this problem does not exist in IPv6 since error routes are
      associated with the loopback device.
      
      [1]
      unreferenced object 0xffff888066650338 (size 56):
        comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff  ..........ba....
          e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00  ...d............
        backtrace:
          [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
          [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
          [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
          [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
          [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
          [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
          [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
          [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
          [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
          [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000003a8b605b>] 0xffffffffffffffff
      unreferenced object 0xffff888061621c88 (size 48):
        comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
        hex dump (first 32 bytes):
          6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
          6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff  kkkkkkkk..&_....
        backtrace:
          [<00000000733609e3>] fib_table_insert+0x978/0x1500
          [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
          [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
          [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
          [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
          [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
          [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
          [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
          [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
          [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
          [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000003a8b605b>] 0xffffffffffffffff
      
      Fixes: 8cced9ef ("[NETNS]: Enable routing configuration in non-initial namespace.")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f97f4dd8
  21. 12 Jan, 2019 3 commits
    • Taehee Yoo's avatar
      net: bpfilter: disallow to remove bpfilter module while being used · 71a85084
      Taehee Yoo authored
      The bpfilter.ko module can be removed while functions of the bpfilter.ko
      are executing. so panic can occurred. in order to protect that, locks can
      be used. a bpfilter_lock protects routines in the
      __bpfilter_process_sockopt() but it's not enough because __exit routine
      can be executed concurrently.
      
      Now, the bpfilter_umh can not run in parallel.
      So, the module do not removed while it's being used and it do not
      double-create UMH process.
      The members of the umh_info and the bpfilter_umh_ops are protected by
      the bpfilter_umh_ops.lock.
      
      test commands:
         while :
         do
      	iptables -I FORWARD -m string --string ap --algo kmp &
      	modprobe -rv bpfilter &
         done
      
      splat looks like:
      [  298.623435] BUG: unable to handle kernel paging request at fffffbfff807440b
      [  298.628512] #PF error: [normal kernel read fault]
      [  298.633018] PGD 124327067 P4D 124327067 PUD 11c1a3067 PMD 119eb2067 PTE 0
      [  298.638859] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [  298.638859] CPU: 0 PID: 2997 Comm: iptables Not tainted 4.20.0+ #154
      [  298.638859] RIP: 0010:__mutex_lock+0x6b9/0x16a0
      [  298.638859] Code: c0 00 00 e8 89 82 ff ff 80 bd 8f fc ff ff 00 0f 85 d9 05 00 00 48 8b 85 80 fc ff ff 48 bf 00 00 00 00 00 fc ff df 48 c1 e8 03 <80> 3c 38 00 0f 85 1d 0e 00 00 48 8b 85 c8 fc ff ff 49 39 47 58 c6
      [  298.638859] RSP: 0018:ffff88810e7777a0 EFLAGS: 00010202
      [  298.638859] RAX: 1ffffffff807440b RBX: ffff888111bd4d80 RCX: 0000000000000000
      [  298.638859] RDX: 1ffff110235ff806 RSI: ffff888111bd5538 RDI: dffffc0000000000
      [  298.638859] RBP: ffff88810e777b30 R08: 0000000080000002 R09: 0000000000000000
      [  298.638859] R10: 0000000000000000 R11: 0000000000000000 R12: fffffbfff168a42c
      [  298.638859] R13: ffff888111bd4d80 R14: ffff8881040e9a05 R15: ffffffffc03a2000
      [  298.638859] FS:  00007f39e3758700(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
      [  298.638859] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  298.638859] CR2: fffffbfff807440b CR3: 000000011243e000 CR4: 00000000001006f0
      [  298.638859] Call Trace:
      [  298.638859]  ? mutex_lock_io_nested+0x1560/0x1560
      [  298.638859]  ? kasan_kmalloc+0xa0/0xd0
      [  298.638859]  ? kmem_cache_alloc+0x1c2/0x260
      [  298.638859]  ? __alloc_file+0x92/0x3c0
      [  298.638859]  ? alloc_empty_file+0x43/0x120
      [  298.638859]  ? alloc_file_pseudo+0x220/0x330
      [  298.638859]  ? sock_alloc_file+0x39/0x160
      [  298.638859]  ? __sys_socket+0x113/0x1d0
      [  298.638859]  ? __x64_sys_socket+0x6f/0xb0
      [  298.638859]  ? do_syscall_64+0x138/0x560
      [  298.638859]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  298.638859]  ? __alloc_file+0x92/0x3c0
      [  298.638859]  ? init_object+0x6b/0x80
      [  298.638859]  ? cyc2ns_read_end+0x10/0x10
      [  298.638859]  ? cyc2ns_read_end+0x10/0x10
      [  298.638859]  ? hlock_class+0x140/0x140
      [  298.638859]  ? sched_clock_local+0xd4/0x140
      [  298.638859]  ? sched_clock_local+0xd4/0x140
      [  298.638859]  ? check_flags.part.37+0x440/0x440
      [  298.638859]  ? __lock_acquire+0x4f90/0x4f90
      [  298.638859]  ? set_rq_offline.part.89+0x140/0x140
      [ ... ]
      
      Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      71a85084
    • Taehee Yoo's avatar
      net: bpfilter: restart bpfilter_umh when error occurred · 61fbf593
      Taehee Yoo authored
      The bpfilter_umh will be stopped via __stop_umh() when the bpfilter
      error occurred.
      The bpfilter_umh() couldn't start again because there is no restart
      routine.
      
      The section of the bpfilter_umh_{start/end} is no longer .init.rodata
      because these area should be reused in the restart routine. hence
      the section name is changed to .bpfilter_umh.
      
      The bpfilter_ops->start() is restart callback. it will be called when
      bpfilter_umh is stopped.
      The stop bit means bpfilter_umh is stopped. this bit is set by both
      start and stop routine.
      
      Before this patch,
      Test commands:
         $ iptables -vnL
         $ kill -9 <pid of bpfilter_umh>
         $ iptables -vnL
         [  480.045136] bpfilter: write fail -32
         $ iptables -vnL
      
      All iptables commands will fail.
      
      After this patch,
      Test commands:
         $ iptables -vnL
         $ kill -9 <pid of bpfilter_umh>
         $ iptables -vnL
         $ iptables -vnL
      
      Now, all iptables commands will work.
      
      Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61fbf593
    • Taehee Yoo's avatar
      net: bpfilter: use cleanup callback to release umh_info · 5b4cb650
      Taehee Yoo authored
      Now, UMH process is killed, do_exit() calls the umh_info->cleanup callback
      to release members of the umh_info.
      This patch makes bpfilter_umh's cleanup routine to use the
      umh_info->cleanup callback.
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b4cb650
  22. 10 Jan, 2019 2 commits
  23. 09 Jan, 2019 1 commit
    • Su Yanjun's avatar
      vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel · dd9ee344
      Su Yanjun authored
      Recently we run a network test over ipcomp virtual tunnel.We find that
      if a ipv4 packet needs fragment, then the peer can't receive
      it.
      
      We deep into the code and find that when packet need fragment the smaller
      fragment will be encapsulated by ipip not ipcomp. So when the ipip packet
      goes into xfrm, it's skb->dev is not properly set. The ipv4 reassembly code
      always set skb'dev to the last fragment's dev. After ipv4 defrag processing,
      when the kernel rp_filter parameter is set, the skb will be drop by -EXDEV
      error.
      
      This patch adds compatible support for the ipip process in ipcomp virtual tunnel.
      Signed-off-by: default avatarSu Yanjun <suyj.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      dd9ee344
  24. 04 Jan, 2019 1 commit