1. 31 Jul, 2018 1 commit
  2. 28 Jun, 2018 1 commit
    • Linus Torvalds's avatar
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds authored
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  3. 26 May, 2018 1 commit
  4. 16 Apr, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: fix SO_RCVLOWAT and RCVBUF autotuning · d1361840
      Eric Dumazet authored
      Applications might use SO_RCVLOWAT on TCP socket hoping to receive
      one [E]POLLIN event only when a given amount of bytes are ready in socket
      receive queue.
      Problem is that receive autotuning is not aware of this constraint,
      meaning sk_rcvbuf might be too small to allow all bytes to be stored.
      Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
      can override the default setsockopt(SO_RCVLOWAT) behavior.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  5. 12 Mar, 2018 1 commit
    • Xin Long's avatar
      sock_diag: request _diag module only when the family or proto has been registered · bf2ae2e4
      Xin Long authored
      Now when using 'ss' in iproute, kernel would try to load all _diag
      modules, which also causes corresponding family and proto modules
      to be loaded as well due to module dependencies.
      Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
      would be loaded.
      For example:
        $ lsmod|grep sctp
        $ ss
        $ lsmod|grep sctp
        sctp_diag              16384  0
        sctp                  323584  5 sctp_diag
        inet_diag              24576  4 raw_diag,tcp_diag,sctp_diag,udp_diag
        libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
      As these family and proto modules are loaded unintentionally, it
      could cause some problems, like:
      - Some debug tools use 'ss' to collect the socket info, which loads all
        those diag and family and protocol modules. It's noisy for identifying
      - Users usually expect to drop sctp init packet silently when they
        have no sense of sctp protocol instead of sending abort back.
      - It wastes resources (especially with multiple netns), and SCTP module
        can't be unloaded once it's loaded.
      In short, it's really inappropriate to have these family and proto
      modules loaded unexpectedly when just doing debugging with inet_diag.
      This patch is to introduce sock_load_diag_module() where it loads
      the _diag module only when it's corresponding family or proto has
      been already registered.
      Note that we can't just load _diag module without the family or
      proto loaded, as some symbols used in _diag module are from the
      family or proto module.
        - move inet proto check to inet_diag to avoid a compiling err.
        - define sock_load_diag_module in sock.c and export one symbol
        - improve the changelog.
      Reported-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarPhil Sutter <phil@nwl.cc>
      Acked-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  6. 12 Feb, 2018 1 commit
    • Denys Vlasenko's avatar
      net: make getname() functions return length rather than use int* parameter · 9b2c45d4
      Denys Vlasenko authored
      Changes since v1:
      Added changes in these files:
      All these functions either return a negative error indicator,
      or store length of sockaddr into "int *socklen" parameter
      and return zero on success.
      "int *socklen" parameter is awkward. For example, if caller does not
      care, it still needs to provide on-stack storage for the value
      it does not need.
      None of the many FOO_getname() functions of various protocols
      ever used old value of *socklen. They always just overwrite it.
      This change drops this parameter, and makes all these functions, on success,
      return length of sockaddr. It's always >= 0 and can be differentiated
      from an error.
      Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.
      rpc_sockname() lost "int buflen" parameter, since its only use was
      to be passed to kernel_getsockname() as &buflen and subsequently
      not used in any way.
      Userspace API is not changed.
          text    data     bss      dec     hex filename
      30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
      30108109 2633612  873672 33615393 200ee21 vmlinux.o
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: linux-bluetooth@vger.kernel.org
      CC: linux-decnet-user@lists.sourceforge.net
      CC: linux-wireless@vger.kernel.org
      CC: linux-rdma@vger.kernel.org
      CC: linux-sctp@vger.kernel.org
      CC: linux-nfs@vger.kernel.org
      CC: linux-x25@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  7. 25 Jan, 2018 1 commit
  8. 27 Nov, 2017 1 commit
  9. 16 Nov, 2017 1 commit
  10. 16 Aug, 2017 1 commit
  11. 01 Aug, 2017 1 commit
  12. 20 Jun, 2017 1 commit
  13. 17 Apr, 2017 1 commit
  14. 06 Apr, 2017 1 commit
  15. 10 Mar, 2017 1 commit
    • David Howells's avatar
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells authored
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      The theory lockdep comes up with is as follows:
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      	mmap_sem must be taken before sk_lock-AF_RXRPC
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      	sk_lock-AF_INET must be taken before mmap_sem
      However, lockdep's theory is wrong in this instance because it deals only
      with lock classes and not individual locks.  The AF_INET lock in (2) isn't
      really equivalent to the AF_INET lock in (3) as the former deals with a
      socket entirely internal to the kernel that never sees userspace.  This is
      a limitation in the design of lockdep.
      Fix the general case by:
       (1) Double up all the locking keys used in sockets so that one set are
           used if the socket is created by userspace and the other set is used
           if the socket is created by the kernel.
       (2) Store the kern parameter passed to sk_alloc() in a variable in the
           sock struct (sk_kern_sock).  This informs sock_lock_init(),
           sock_init_data() and sk_clone_lock() as to the lock keys to be used.
           Note that the child created by sk_clone_lock() inherits the parent's
           kern setting.
       (3) Add a 'kern' parameter to ->accept() that is analogous to the one
           passed in to ->create() that distinguishes whether kernel_accept() or
           sys_accept4() was the caller and can be passed to sk_alloc().
           Note that a lot of accept functions merely dequeue an already
           allocated socket.  I haven't touched these as the new socket already
           exists before we get the parameter.
           Note also that there are a couple of places where I've made the accepted
           socket unconditionally kernel-based:
           because they follow a sock_create_kern() and accept off of that.
      Whilst creating this, I noticed that lustre and ocfs don't create sockets
      through sock_create_kern() and thus they aren't marked as for-kernel,
      though they appear to be internal.  I wonder if these should do that so
      that they use the new set of lock keys.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  16. 29 Aug, 2016 1 commit
  17. 01 Jul, 2016 1 commit
    • Jason Wang's avatar
      tun: switch to use skb array for tx · 1576d986
      Jason Wang authored
      We used to queue tx packets in sk_receive_queue, this is less
      efficient since it requires spinlocks to synchronize between producer
      and consumer.
      This patch tries to address this by:
      - switch from sk_receive_queue to a skb_array, and resize it when
        tx_queue_len was changed.
      - introduce a new proto_ops peek_len which was used for peeking the
        skb length.
      - implement a tun version of peek_len for vhost_net to use and convert
        vhost_net to use peek_len if possible.
      Pktgen test shows about 15.3% improvement on guest receiving pps for small
      Before: ~1300000pps
      After : ~1500000pps
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  18. 16 Jun, 2016 1 commit
    • Jason Donenfeld's avatar
      net: Don't forget pr_fmt on net_dbg_ratelimited for CONFIG_DYNAMIC_DEBUG · daddef76
      Jason Donenfeld authored
      The implementation of net_dbg_ratelimited in the CONFIG_DYNAMIC_DEBUG
      case was added with 2c94b537 ("net: Implement net_dbg_ratelimited() for
      CONFIG_DYNAMIC_DEBUG case"). The implementation strategy was to take the
      usual definition of the dynamic_pr_debug macro, but alter it by adding a
      call to "net_ratelimit()" in the if statement. This is, in fact, the
      correct approach.
      However, while doing this, the author of the commit forgot to surround
      fmt by pr_fmt, resulting in unprefixed log messages appearing in the
      console. So, this commit adds back the pr_fmt(fmt) invocation, making
      net_dbg_ratelimited properly consistent across DEBUG, no DEBUG, and
      DYNAMIC_DEBUG cases, and bringing parity with the behavior of
      dynamic_pr_debug as well.
      Fixes: 2c94b537 ("net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case")
      Signed-off-by: Jason Donenfeld's avatarJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Tim Bingham <tbingham@akamai.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  19. 02 May, 2016 1 commit
    • Tim Bingham's avatar
      net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case · 2c94b537
      Tim Bingham authored
      Prior to commit d92cff89 ("net_dbg_ratelimited: turn into no-op
      when !DEBUG") the implementation of net_dbg_ratelimited() was buggy
      for both the DEBUG and CONFIG_DYNAMIC_DEBUG cases.
      The bug was that net_ratelimit() was being called and, despite
      returning true, nothing was being printed to the console. This
      resulted in messages like the following -
      "net_ratelimit: %d callbacks suppressed"
      with no other output nearby.
      After commit d92cff89 ("net_dbg_ratelimited: turn into no-op when
      !DEBUG") the bug is fixed for the DEBUG case. However, there's no
      output at all for CONFIG_DYNAMIC_DEBUG case.
      This patch restores debug output (if enabled) for the
      Add a definition of net_dbg_ratelimited() for the CONFIG_DYNAMIC_DEBUG
      case. The implementation takes care to check that dynamic debugging is
      enabled before calling net_ratelimit().
      Fixes: d92cff89 ("net_dbg_ratelimited: turn into no-op when !DEBUG")
      Signed-off-by: default avatarTim Bingham <tbingham@akamai.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  20. 28 Mar, 2016 1 commit
  21. 09 Mar, 2016 1 commit
  22. 01 Dec, 2015 2 commits
    • Eric Dumazet's avatar
      net: fix sock_wake_async() rcu protection · ceb5d58b
      Eric Dumazet authored
      Dmitry provided a syzkaller (http://github.com/google/syzkaller)
      triggering a fault in sock_wake_async() when async IO is requested.
      Said program stressed af_unix sockets, but the issue is generic
      and should be addressed in core networking stack.
      The problem is that by the time sock_wake_async() is called,
      we should not access the @flags field of 'struct socket',
      as the inode containing this socket might be freed without
      further notice, and without RCU grace period.
      We already maintain an RCU protected structure, "struct socket_wq"
      is the safe route.
      It also reduces number of cache lines needing dirtying, so might
      provide a performance improvement anyway.
      In followup patches, we might move remaining flags (SOCK_NOSPACE,
      SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
      being mostly read and let it being shared between cpus.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric Dumazet's avatar
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet authored
      This patch is a cleanup to make following patch easier to
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      To ease backports, we rename both constants.
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  23. 08 Oct, 2015 1 commit
  24. 07 Aug, 2015 1 commit
    • Jason Donenfeld's avatar
      net_dbg_ratelimited: turn into no-op when !DEBUG · d92cff89
      Jason Donenfeld authored
      The pr_debug family of functions turns into a no-op when -DDEBUG is not
      specified, opting instead to call "no_printk", which gets compiled to a
      no-op (but retains gcc's nice warnings about printf-style arguments).
      The problem with net_dbg_ratelimited is that it is defined to be a
      variant of net_ratelimited_function, which expands to essentially:
          if (net_ratelimit())
              pr_debug(fmt, ...);
      When DEBUG is not defined, then this becomes,
          if (net_ratelimit())
      This seems benign, except it isn't. Firstly, there's the obvious
      overhead of calling net_ratelimit needlessly, which does quite some book
      keeping for the rate limiting. Given that the pr_debug and
      net_dbg_ratelimited family of functions are sprinkled liberally through
      performance critical code, with developers assuming they'll be compiled
      out to a no-op most of the time, we certainly do not want this needless
      book keeping. Secondly, and most visibly, even though no debug message
      is printed when DEBUG is not defined, if there is a flood of
      invocations, dmesg winds up peppered with messages such as
      "net_ratelimit: 320 callbacks suppressed". This is because our
      aforementioned net_ratelimit() function actually prints this text in
      some circumstances. It's especially odd to see this when there isn't any
      other accompanying debug message.
      So, in sum, it doesn't make sense to have this function's current
      behavior, and instead it should match what every other debug family of
      functions in the kernel does with !DEBUG -- nothing.
      This patch replaces calls to net_dbg_ratelimited when !DEBUG with
      no_printk, keeping with the idiom of all the other debug print helpers.
      Also, though not strictly neccessary, it guards the call with an if (0)
      so that all evaluation of any arguments are sure to be compiled out.
      Signed-off-by: Jason Donenfeld's avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  25. 11 May, 2015 2 commits
  26. 11 Apr, 2015 1 commit
  27. 02 Mar, 2015 1 commit
  28. 14 May, 2014 1 commit
    • Hannes Frederic Sowa's avatar
      net: avoid dependency of net_get_random_once on nop patching · 3d440522
      Hannes Frederic Sowa authored
      net_get_random_once depends on the static keys infrastructure to patch up
      the branch to the slow path during boot. This was realized by abusing the
      static keys api and defining a new initializer to not enable the call
      site while still indicating that the branch point should get patched
      up. This was needed to have the fast path considered likely by gcc.
      The static key initialization during boot up normally walks through all
      the registered keys and either patches in ideal nops or enables the jump
      site but omitted that step on x86 if ideal nops where already placed at
      static_key branch points. Thus net_get_random_once branches not always
      became active.
      This patch switches net_get_random_once to the ordinary static_key
      api and thus places the kernel fast path in the - by gcc considered -
      unlikely path.  Microbenchmarks on Intel and AMD x86-64 showed that
      the unlikely path actually beats the likely path in terms of cycle cost
      and that different nop patterns did not make much difference, thus this
      switch should not be noticeable.
      Fixes: a48e4292 ("net: introduce new macro net_get_random_once")
      Reported-by: default avatarTuomas Räsänen <tuomasjjrasanen@tjjr.fi>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  29. 14 Jan, 2014 1 commit
  30. 11 Dec, 2013 1 commit
  31. 21 Nov, 2013 1 commit
    • Hannes Frederic Sowa's avatar
      net: rework recvmsg handler msg_name and msg_namelen logic · f3d33426
      Hannes Frederic Sowa authored
      This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
      set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
      to return msg_name to the user.
      This prevents numerous uninitialized memory leaks we had in the
      recvmsg handlers and makes it harder for new code to accidentally leak
      uninitialized memory.
      Optimize for the case recvfrom is called with NULL as address. We don't
      need to copy the address at all, so set it to NULL before invoking the
      recvmsg handler. We can do so, because all the recvmsg handlers must
      cope with the case a plain read() is called on them. read() also sets
      msg_name to NULL.
      Also document these changes in include/linux/net.h as suggested by David
      Changes since RFC:
      Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
      non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
      affect sendto as it would bail out earlier while trying to copy-in the
      address. It also more naturally reflects the logic by the callers of
      With this change in place I could remove "
      if (!uaddr || msg_sys->msg_namelen == 0)
      	msg->msg_name = NULL
      This change does not alter the user visible error logic as we ignore
      msg_namelen as long as msg_name is NULL.
      Also remove two unnecessary curly brackets in ___sys_recvmsg and change
      comments to netdev style.
      Cc: David Miller <davem@davemloft.net>
      Suggested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  32. 25 Oct, 2013 1 commit
  33. 21 Oct, 2013 1 commit
  34. 19 Oct, 2013 1 commit
    • Hannes Frederic Sowa's avatar
      net: introduce new macro net_get_random_once · a48e4292
      Hannes Frederic Sowa authored
      net_get_random_once is a new macro which handles the initialization
      of secret keys. It is possible to call it in the fast path. Only the
      initialization depends on the spinlock and is rather slow. Otherwise
      it should get used just before the key is used to delay the entropy
      extration as late as possible to get better randomness. It returns true
      if the key got initialized.
      The usage of static_keys for net_get_random_once is a bit uncommon so
      it needs some further explanation why this actually works:
      === In the simple non-HAVE_JUMP_LABEL case we actually have ===
      no constrains to use static_key_(true|false) on keys initialized with
      STATIC_KEY_INIT_(FALSE|TRUE). So this path just expands in favor of
      the likely case that the initialization is already done. The key is
      initialized like this:
      ___done_key = { .enabled = ATOMIC_INIT(0) }
      The check
                      if (!static_key_true(&___done_key))                     \
      expands into (pseudo code)
                      if (!likely(___done_key > 0))
      , so we take the fast path as soon as ___done_key is increased from the
      helper function.
      === If HAVE_JUMP_LABELs are available this depends ===
      on patching of jumps into the prepared NOPs, which is done in
      jump_label_init at boot-up time (from start_kernel). It is forbidden
      and dangerous to use net_get_random_once in functions which are called
      before that!
      At compilation time NOPs are generated at the call sites of
      net_get_random_once. E.g. net/ipv6/inet6_hashtable.c:inet6_ehashfn (we
      need to call net_get_random_once two times in inet6_ehashfn, so two NOPs):
            71:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
            76:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      Both will be patched to the actual jumps to the end of the function to
      call __net_get_random_once at boot time as explained above.
      arch_static_branch is optimized and inlined for false as return value and
      actually also returns false in case the NOP is placed in the instruction
      stream. So in the fast case we get a "return false". But because we
      initialize ___done_key with (enabled != (entries & 1)) this call-site
      will get patched up at boot thus returning true. The final check looks
      like this:
                      if (!static_key_true(&___done_key))                     \
                              ___ret = __net_get_random_once(buf,             \
      expands to
                      if (!!static_key_false(&___done_key))                     \
                              ___ret = __net_get_random_once(buf,             \
      So we get true at boot time and as soon as static_key_slow_inc is called
      on the key it will invert the logic and return false for the fast path.
      static_key_slow_inc will change the branch because it got initialized
      with .enabled == 0. After static_key_slow_inc is called on the key the
      branch is replaced with a nop again.
      === Misc: ===
      The helper defers the increment into a workqueue so we don't
      have problems calling this code from atomic sections. A seperate boolean
      (___done) guards the case where we enter net_get_random_once again before
      the increment happend.
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  35. 26 Sep, 2013 1 commit
    • Joe Perches's avatar
      net.h/skbuff.h: Remove extern from function prototypes · 7965bd4d
      Joe Perches authored
      There are a mix of function prototypes with and without extern
      in the kernel sources.  Standardize on not using extern for
      function prototypes.
      Function prototypes don't need to be written with extern.
      extern is assumed by the compiler.  Its use is as unnecessary as
      using auto to declare automatic/local variables in a block.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
  36. 04 Jun, 2013 1 commit
  37. 30 Apr, 2013 1 commit
  38. 13 Oct, 2012 1 commit