1. 12 Feb, 2019 1 commit
    • Konstantin Khlebnikov's avatar
      inet_diag: fix reporting cgroup classid and fallback to priority · 1ec17dbd
      Konstantin Khlebnikov authored
      Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested
      extensions has only 8 bits. Thus extensions starting from DCTCPINFO
      cannot be requested directly. Some of them included into response
      unconditionally or hook into some of lower 8 bits.
      
      Extension INET_DIAG_CLASS_ID has not way to request from the beginning.
      
      This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space
      reservation, and documents behavior for other extensions.
      
      Also this patch adds fallback to reporting socket priority. This filed
      is more widely used for traffic classification because ipv4 sockets
      automatically maps TOS to priority and default qdisc pfifo_fast knows
      about that. But priority could be changed via setsockopt SO_PRIORITY so
      INET_DIAG_TOS isn't enough for predicting class.
      
      Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot
      reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen).
      
      So, after this patch INET_DIAG_CLASS_ID will report socket priority
      for most common setup when net_cls isn't set and/or cgroup2 in use.
      
      Fixes: 0888e372 ("net: inet: diag: expose sockets cgroup classid")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ec17dbd
  2. 21 Dec, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: fix a race in inet_diag_dump_icsk() · f0c928d8
      Eric Dumazet authored
      Alexei reported use after frees in inet_diag_dump_icsk() [1]
      
      Because we use refcount_set() when various sockets are setup and
      inserted into ehash, we also need to make sure inet_diag_dump_icsk()
      wont race with the refcount_set() operations.
      
      Jonathan Lemon sent a patch changing net_twsk_hashdance() but
      other spots would need risky changes.
      
      Instead, fix inet_diag_dump_icsk() as this bug came with
      linux-4.10 only.
      
      [1] Quoting Alexei :
      
      First something iterating over sockets finds already freed tw socket:
      
      refcount_t: increment on 0; use-after-free.
      WARNING: CPU: 2 PID: 2738 at lib/refcount.c:153 refcount_inc+0x26/0x30
      RIP: 0010:refcount_inc+0x26/0x30
      RSP: 0018:ffffc90004c8fbc0 EFLAGS: 00010282
      RAX: 000000000000002b RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff88085ee9d680 RSI: ffff88085ee954c8 RDI: ffff88085ee954c8
      RBP: ffff88010ecbd2c0 R08: 0000000000000000 R09: 000000000000174c
      R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff8806ba9bf210 R14: ffffffff82304600 R15: ffff88010ecbd328
      FS:  00007f81f5a7d700(0000) GS:ffff88085ee80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f81e2a95000 CR3: 000000069b2eb006 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       inet_diag_dump_icsk+0x2b3/0x4e0 [inet_diag]  // sock_hold(sk); in net/ipv4/inet_diag.c:1002
       ? kmalloc_large_node+0x37/0x70
       ? __kmalloc_node_track_caller+0x1cb/0x260
       ? __alloc_skb+0x72/0x1b0
       ? __kmalloc_reserve.isra.40+0x2e/0x80
       __inet_diag_dump+0x3b/0x80 [inet_diag]
       netlink_dump+0x116/0x2a0
       netlink_recvmsg+0x205/0x3c0
       sock_read_iter+0x89/0xd0
       __vfs_read+0xf7/0x140
       vfs_read+0x8a/0x140
       SyS_read+0x3f/0xa0
       do_syscall_64+0x5a/0x100
      
      then a minute later twsk timer fires and hits two bad refcnts
      for this freed socket:
      
      refcount_t: decrement hit 0; leaking memory.
      WARNING: CPU: 31 PID: 0 at lib/refcount.c:228 refcount_dec+0x2e/0x40
      Modules linked in:
      RIP: 0010:refcount_dec+0x2e/0x40
      RSP: 0018:ffff88085f5c3ea8 EFLAGS: 00010296
      RAX: 000000000000002c RBX: ffff88010ecbd2c0 RCX: 000000000000083f
      RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
      RBP: ffffc90003c77280 R08: 0000000000000000 R09: 00000000000017d3
      R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffffffff82ad2d80
      R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       inet_twsk_kill+0x9d/0xc0  // inet_twsk_bind_unhash(tw, hashinfo);
       call_timer_fn+0x29/0x110
       run_timer_softirq+0x36b/0x3a0
      
      refcount_t: underflow; use-after-free.
      WARNING: CPU: 31 PID: 0 at lib/refcount.c:187 refcount_sub_and_test+0x46/0x50
      RIP: 0010:refcount_sub_and_test+0x46/0x50
      RSP: 0018:ffff88085f5c3eb8 EFLAGS: 00010296
      RAX: 0000000000000026 RBX: ffff88010ecbd2c0 RCX: 000000000000083f
      RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
      RBP: ffff88010ecbd358 R08: 0000000000000000 R09: 000000000000185b
      R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffff88010ecbd358
      R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       inet_twsk_put+0x12/0x20  // inet_twsk_put(tw);
       call_timer_fn+0x29/0x110
       run_timer_softirq+0x36b/0x3a0
      
      Fixes: 67db3e4b ("tcp: no longer hold ehash lock while calling tcp_get_info()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f0c928d8
  3. 12 Mar, 2018 1 commit
    • Xin Long's avatar
      sock_diag: request _diag module only when the family or proto has been registered · bf2ae2e4
      Xin Long authored
      Now when using 'ss' in iproute, kernel would try to load all _diag
      modules, which also causes corresponding family and proto modules
      to be loaded as well due to module dependencies.
      
      Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
      would be loaded.
      
      For example:
      
        $ lsmod|grep sctp
        $ ss
        $ lsmod|grep sctp
        sctp_diag              16384  0
        sctp                  323584  5 sctp_diag
        inet_diag              24576  4 raw_diag,tcp_diag,sctp_diag,udp_diag
        libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
      
      As these family and proto modules are loaded unintentionally, it
      could cause some problems, like:
      
      - Some debug tools use 'ss' to collect the socket info, which loads all
        those diag and family and protocol modules. It's noisy for identifying
        issues.
      
      - Users usually expect to drop sctp init packet silently when they
        have no sense of sctp protocol instead of sending abort back.
      
      - It wastes resources (especially with multiple netns), and SCTP module
        can't be unloaded once it's loaded.
      
      ...
      
      In short, it's really inappropriate to have these family and proto
      modules loaded unexpectedly when just doing debugging with inet_diag.
      
      This patch is to introduce sock_load_diag_module() where it loads
      the _diag module only when it's corresponding family or proto has
      been already registered.
      
      Note that we can't just load _diag module without the family or
      proto loaded, as some symbols used in _diag module are from the
      family or proto module.
      
      v1->v2:
        - move inet proto check to inet_diag to avoid a compiling err.
      v2->v3:
        - define sock_load_diag_module in sock.c and export one symbol
          only.
        - improve the changelog.
      Reported-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarPhil Sutter <phil@nwl.cc>
      Acked-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf2ae2e4
  4. 02 Jan, 2018 1 commit
    • Kristian Evensen's avatar
      inet_diag: Add equal-operator for ports · bbb6189d
      Kristian Evensen authored
      inet_diag currently provides less/greater than or equal operators for
      comparing ports when filtering sockets. An equal comparison can be
      performed by combining the two existing operators, or a user can for
      example request a port range and then do the final filtering in
      userspace. However, these approaches both have drawbacks. Implementing
      equal using LE/GE causes the size and complexity of a filter to grow
      quickly as the number of ports increase, while it on busy machines would
      be great if the kernel only returns information about relevant sockets.
      
      This patch introduces source and destination port equal operators.
      INET_DIAG_BC_S_EQ is used to match a source port, INET_DIAG_BC_D_EQ a
      destination port, and usage is the same as for the existing port
      operators.  I.e., the port to match is stored in the no-member of the
      next inet_diag_bc_op-struct in the filter.
      Signed-off-by: default avatarKristian Evensen <kristian.evensen@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbb6189d
  5. 02 Sep, 2017 1 commit
  6. 18 Aug, 2017 1 commit
  7. 14 Jan, 2017 2 commits
    • Yuchung Cheng's avatar
      tcp: remove early retransmit · bec41a11
      Yuchung Cheng authored
      This patch removes the support of RFC5827 early retransmit (i.e.,
      fast recovery on small inflight with <3 dupacks) because it is
      subsumed by the new RACK loss detection. More specifically when
      RACK receives DUPACKs, it'll arm a reordering timer to start fast
      recovery after a quarter of (min)RTT, hence it covers the early
      retransmit except RACK does not limit itself to specific inflight
      or dupack numbers.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bec41a11
    • Yuchung Cheng's avatar
      tcp: add reordering timer in RACK loss detection · 57dde7f7
      Yuchung Cheng authored
      This patch makes RACK install a reordering timer when it suspects
      some packets might be lost, but wants to delay the decision
      a little bit to accomodate reordering.
      
      It does not create a new timer but instead repurposes the existing
      RTO timer, because both are meant to retransmit packets.
      Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
      the RACK timing check fails. The wait time is set to
      
        RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge
      
      This translates to expecting a packet (Packet) should take
      (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.
      
      When there are multiple packets that need a timer, we use one timer
      with the maximum timeout. Therefore the timer conservatively uses
      the maximum window to expire N packets by one timeout, instead of
      N timeouts to expire N packets sent at different times.
      
      The fudge factor is 2 jiffies to ensure when the timer fires, all
      the suspected packets would exceed the deadline and be marked lost
      by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
      clock may tick between calling icsk_reset_xmit_timer(timeout) and
      actually hang the timer. The next jiffy is to lower-bound the timeout
      to 2 jiffies when reo_wnd is < 1ms.
      
      When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
      in Recovery we'll enter fast recovery and force fast retransmit.
      This is very similar to the early retransmit (RFC5827) except RACK
      is not constrained to only enter recovery for small outstanding
      flights.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      57dde7f7
  8. 09 Nov, 2016 1 commit
  9. 23 Oct, 2016 1 commit
    • Cyrill Gorcunov's avatar
      net: ip, diag -- Add diag interface for raw sockets · 432490f9
      Cyrill Gorcunov authored
      In criu we are actively using diag interface to collect sockets
      present in the system when dumping applications. And while for
      unix, tcp, udp[lite], packet, netlink it works as expected,
      the raw sockets do not have. Thus add it.
      
      v2:
       - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
       - implement @Destroy for diag requests (by dsa@)
      
      v3:
       - add export of raw_abort for IPv6 (by dsa@)
       - pass net-admin flag into inet_sk_diag_fill due to
         changes in net-next branch (by dsa@)
      
      v4:
       - use @pad in struct inet_diag_req_v2 for raw socket
         protocol specification: raw module carries sockets
         which may have custom protocol passed from socket()
         syscall and sole @sdiag_protocol is not enough to
         match underlied ones
       - start reporting protocol specifed in socket() call
         when sockets are raw ones for the same reason: user
         space tools like ss may parse this attribute and use
         it for socket matching
      
      v5 (by eric.dumazet@):
       - use sock_hold in raw_sock_get instead of atomic_inc,
         we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
         when looking up so counter won't be zero here.
      
      v6:
       - use sdiag_raw_protocol() helper which will access @pad
         structure used for raw sockets protocol specification:
         we can't simply rename this member without breaking uapi
      
      v7:
       - sine sdiag_raw_protocol() helper is not suitable for
         uapi lets rather make an alias structure with proper
         names. __check_inet_diag_req_raw helper will catch
         if any of structure unintentionally changed.
      
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Andrey Vagin <avagin@openvz.org>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      432490f9
  10. 20 Oct, 2016 1 commit
    • Eric Dumazet's avatar
      tcp: relax listening_hash operations · 9652dc2e
      Eric Dumazet authored
      softirq handlers use RCU protection to lookup listeners,
      and write operations all happen from process context.
      We do not need to block BH for dump operations.
      
      Also SYN_RECV since request sockets are stored in the ehash table :
      
       1) inet_diag_dump_icsk() no longer need to clear
          cb->args[3] and cb->args[4] that were used as cursors while
          iterating the old per listener hash table.
      
       2) Also factorize a test : No need to scan listening_hash[]
          if r->id.idiag_dport is not zero.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9652dc2e
  11. 08 Sep, 2016 1 commit
  12. 25 Aug, 2016 2 commits
  13. 28 Jun, 2016 1 commit
  14. 26 Apr, 2016 1 commit
  15. 21 Apr, 2016 1 commit
  16. 15 Apr, 2016 1 commit
    • Xin Long's avatar
      sctp: export some functions for sctp_diag in inet_diag · cb2050a7
      Xin Long authored
      inet_diag_msg_common_fill is used to fill the diag msg common info,
      we need to use it in sctp_diag as well, so export it.
      
      inet_diag_msg_attrs_fill is used to fill some common attrs info between
      sctp diag and tcp diag.
      
      v2->v3:
      - do not need to define and export inet_diag_get_handler any more.
        cause all the functions in it are in sctp_diag.ko, we just call
        them in sctp_diag.ko.
      
      - add inet_diag_msg_attrs_fill to make codes clear.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb2050a7
  17. 05 Apr, 2016 2 commits
    • Eric Dumazet's avatar
      tcp/dccp: do not touch listener sk_refcnt under synflood · 3b24d854
      Eric Dumazet authored
      When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
      cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.
      
      By letting listeners use SOCK_RCU_FREE infrastructure,
      we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt
      
      Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
      only listeners are impacted by this change.
      
      Peak performance under SYNFLOOD is increased by ~33% :
      
      On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps
      
      Most consuming functions are now skb_set_owner_w() and sock_wfree()
      contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b24d854
    • Eric Dumazet's avatar
      tcp/dccp: use rcu locking in inet_diag_find_one_icsk() · 2d331915
      Eric Dumazet authored
      RX packet processing holds rcu_read_lock(), so we can remove
      pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
      if inet_diag also holds rcu before calling them.
      
      This is needed anyway as __inet_lookup_listener() and
      inet6_lookup_listener() will soon no longer increment
      refcount on the found listener.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d331915
  18. 14 Mar, 2016 1 commit
  19. 11 Feb, 2016 1 commit
  20. 21 Jan, 2016 1 commit
  21. 16 Dec, 2015 2 commits
  22. 03 Oct, 2015 2 commits
    • Eric Dumazet's avatar
      tcp/dccp: install syn_recv requests into ehash table · 079096f1
      Eric Dumazet authored
      In this patch, we insert request sockets into TCP/DCCP
      regular ehash table (where ESTABLISHED and TIMEWAIT sockets
      are) instead of using the per listener hash table.
      
      ACK packets find SYN_RECV pseudo sockets without having
      to find and lock the listener.
      
      In nominal conditions, this halves pressure on listener lock.
      
      Note that this will allow for SO_REUSEPORT refinements,
      so that we can select a listener using cpu/numa affinities instead
      of the prior 'consistent hash', since only SYN packets will
      apply this selection logic.
      
      We will shrink listen_sock in the following patch to ease
      code review.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Ying Cai <ycai@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      079096f1
    • Eric Dumazet's avatar
      tcp: move qlen/young out of struct listen_sock · aac065c5
      Eric Dumazet authored
      qlen_inc & young_inc were protected by listener lock,
      while qlen_dec & young_dec were atomic fields.
      
      Everything needs to be atomic for upcoming lockless listener.
      
      Also move qlen/young in request_sock_queue as we'll get rid
      of struct listen_sock eventually.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aac065c5
  23. 11 Jul, 2015 1 commit
    • Phil Sutter's avatar
      net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets · 8220ea23
      Phil Sutter authored
      Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY
      sockopt", I am not happy with the limitations it causes for socket
      analysing code in userspace. Exporting the value only if it is set makes
      it hard for userspace to decide whether the option is not set or the
      kernel does not support exporting the option at all.
      
      >From an auditor's perspective, the interesting question for listening
      AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is
      the unexpected case. This patch allows to answer this question reliably.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8220ea23
  24. 24 Jun, 2015 1 commit
    • Phil Sutter's avatar
      net: inet_diag: export IPV6_V6ONLY sockopt · 20462155
      Phil Sutter authored
      For AF_INET6 sockets, the value of struct ipv6_pinfo.ipv6only is
      exported to userspace. It indicates whether a socket bound to in6addr_any
      listens on IPv4 as well as IPv6. Since the socket is natively IPv6, it is not
      listed by e.g. 'ss -l -4'.
      
      This patch is accompanied by an appropriate one for iproute2 to enable
      the additional information in 'ss -e'.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20462155
  25. 23 Jun, 2015 1 commit
  26. 21 Jun, 2015 1 commit
  27. 16 Jun, 2015 2 commits
  28. 29 Apr, 2015 1 commit
  29. 17 Apr, 2015 1 commit
    • Eric Dumazet's avatar
      inet_diag: fix access to tcp cc information · 521f1cf1
      Eric Dumazet authored
      Two different problems are fixed here :
      
      1) inet_sk_diag_fill() might be called without socket lock held.
         icsk->icsk_ca_ops can change under us and module be unloaded.
         -> Access to freed memory.
         Fix this using rcu_read_lock() to prevent module unload.
      
      2) Some TCP Congestion Control modules provide information
         but again this is not safe against icsk->icsk_ca_ops
         change and nla_put() errors were ignored. Some sockets
         could not get the additional info if skb was almost full.
      
      Fix this by returning a status from get_info() handlers and
      using rcu protection as well.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      521f1cf1
  30. 13 Apr, 2015 1 commit
    • Eric Dumazet's avatar
      tcp/dccp: get rid of central timewait timer · 789f558c
      Eric Dumazet authored
      Using a timer wheel for timewait sockets was nice ~15 years ago when
      memory was expensive and machines had a single processor.
      
      This does not scale, code is ugly and source of huge latencies
      (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)
      
      We can afford to use an extra 64 bytes per timewait sock and spread
      timewait load to all cpus to have better behavior.
      
      Tested:
      
      On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
      on the target (lpaa24)
      
      Before patch :
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      419594
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      437171
      
      While test is running, we can observe 25 or even 33 ms latencies.
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
      rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
      rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2
      
      After patch :
      
      About 90% increase of throughput :
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      810442
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      800992
      
      And latencies are kept to minimal values during this load, even
      if network utilization is 90% higher :
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
      rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      789f558c
  31. 23 Mar, 2015 1 commit
  32. 20 Mar, 2015 1 commit
    • Eric Dumazet's avatar
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet authored
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa76ce73
  33. 19 Mar, 2015 1 commit
  34. 16 Mar, 2015 1 commit