1. 16 Nov, 2017 4 commits
  2. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <[email protected]>
      Reviewed-by: default avatarPhilippe Ombredanne <[email protected]>
      Reviewed-by: default avatarThomas Gleixner <[email protected]>
      Signed-off-by: default avatarGreg Kroah-Hartman <[email protected]>
      b2441318
  3. 12 Jul, 2017 1 commit
    • Michal Hocko's avatar
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko authored
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [[email protected]: fix arch/sparc/kernel/mdesc.c]
      [[email protected]: semantic fix]
        Link: http://lkml.kernel.org/r/[email protected]
      [[email protected]: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/[email protected]
      Link: http://lkml.kernel.org/r/[email protected]Signed-off-by: default avatarMichal Hocko <[email protected]>
      Acked-by: default avatarVlastimil Babka <[email protected]>
      Cc: Alex Belits <[email protected]>
      Cc: Chris Wilson <[email protected]>
      Cc: Christoph Hellwig <[email protected]>
      Cc: Darrick J. Wong <[email protected]>
      Cc: David Daney <[email protected]>
      Cc: Johannes Weiner <[email protected]>
      Cc: Mel Gorman <[email protected]>
      Cc: NeilBrown <[email protected]>
      Cc: Ralf Baechle <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      dcda9b04
  4. 18 Apr, 2017 1 commit
  5. 23 Feb, 2017 4 commits
    • Tejun Heo's avatar
      slab: remove synchronous synchronize_sched() from memcg cache deactivation path · 01fb58bc
      Tejun Heo authored
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      slub uses synchronize_sched() to deactivate a memcg cache.
      synchronize_sched() is an expensive and slow operation and doesn't scale
      when a huge number of caches are destroyed back-to-back.  While there
      used to be a simple batching mechanism, the batching was too restricted
      to be helpful.
      
      This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
      can use to schedule sched RCU callback instead of performing
      synchronize_sched() synchronously while holding cgroup_mutex.  While
      this adds online cpus, mems and slab_mutex operations, operating on
      these locks back-to-back from the same kworker, which is what's gonna
      happen when there are many to deactivate, isn't expensive at all and
      this gets rid of the scalability problem completely.
      
      Link: http://lkml.kernel.org/r/[email protected]Signed-off-by: default avatarTejun Heo <[email protected]>
      Reported-by: default avatarJay Vana <[email protected]>
      Acked-by: Vladimir Davydov's avatarVladimir Davydov <[email protected]>
      Cc: Christoph Lameter <[email protected]>
      Cc: Pekka Enberg <[email protected]>
      Cc: David Rientjes <[email protected]>
      Cc: Joonsoo Kim <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      01fb58bc
    • Tejun Heo's avatar
      slab: implement slab_root_caches list · 510ded33
      Tejun Heo authored
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      slab_caches currently lists all caches including root and memcg ones.
      This is the only data structure which lists the root caches and
      iterating root caches can only be done by walking the list while
      skipping over memcg caches.  As there can be a huge number of memcg
      caches, this can become very expensive.
      
      This also can make /proc/slabinfo behave very badly.  seq_file processes
      reads in 4k chunks and seeks to the previous Nth position on slab_caches
      list to resume after each chunk.  With a lot of memcg cache churns on
      the list, reading /proc/slabinfo can become very slow and its content
      often ends up with duplicate and/or missing entries.
      
      This patch adds a new list slab_root_caches which lists only the root
      caches.  When memcg is not enabled, it becomes just an alias of
      slab_caches.  memcg specific list operations are collected into
      memcg_[un]link_cache().
      
      Link: http://lkml.kernel.org/r/[email protected]Signed-off-by: default avatarTejun Heo <[email protected]>
      Reported-by: default avatarJay Vana <[email protected]>
      Acked-by: default avatarVladimir Davydov <[email protected]>
      Cc: Christoph Lameter <[email protected]>
      Cc: Pekka Enberg <[email protected]>
      Cc: David Rientjes <[email protected]>
      Cc: Joonsoo Kim <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      510ded33
    • Tejun Heo's avatar
      slab: link memcg kmem_caches on their associated memory cgroup · bc2791f8
      Tejun Heo authored
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      While a memcg kmem_cache is listed on its root cache's ->children list,
      there is no direct way to iterate all kmem_caches which are assocaited
      with a memory cgroup.  The only way to iterate them is walking all
      caches while filtering out caches which don't match, which would be most
      of them.
      
      This makes memcg destruction operations O(N^2) where N is the total
      number of slab caches which can be huge.  This combined with the
      synchronous RCU operations can tie up a CPU and affect the whole machine
      for many hours when memory reclaim triggers offlining and destruction of
      the stale memcgs.
      
      This patch adds mem_cgroup->kmem_caches list which goes through
      memcg_cache_params->kmem_caches_node of all kmem_caches which are
      associated with the memcg.  All memcg specific iterations, including
      stat file access, are updated to use the new list instead.
      
      Link: http://lkml.kernel.org/r/[email protected]Signed-off-by: default avatarTejun Heo <[email protected]>
      Reported-by: default avatarJay Vana <[email protected]>
      Acked-by: Vladimir Davydov's avatarVladimir Davydov <[email protected]>
      Cc: Christoph Lameter <[email protected]>
      Cc: Pekka Enberg <[email protected]>
      Cc: David Rientjes <[email protected]>
      Cc: Joonsoo Kim <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      bc2791f8
    • Tejun Heo's avatar
      slab: reorganize memcg_cache_params · 9eeadc8b
      Tejun Heo authored
      We're going to change how memcg caches are iterated.  In preparation,
      clean up and reorganize memcg_cache_params.
      
      * The shared ->list is replaced by ->children in root and
        ->children_node in children.
      
      * ->is_root_cache is removed.  Instead ->root_cache is moved out of
        the child union and now used by both root and children.  NULL
        indicates root cache.  Non-NULL a memcg one.
      
      This patch doesn't cause any observable behavior changes.
      
      Link: http://lkml.kernel.org/r/[email protected]Signed-off-by: default avatarTejun Heo <[email protected]>
      Acked-by: Vladimir Davydov's avatarVladimir Davydov <[email protected]>
      Cc: Christoph Lameter <[email protected]>
      Cc: Pekka Enberg <[email protected]>
      Cc: David Rientjes <[email protected]>
      Cc: Joonsoo Kim <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      9eeadc8b
  6. 11 Jan, 2017 1 commit
    • Michal Hocko's avatar
      mm, slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER · bb1107f7
      Michal Hocko authored
      Andrey Konovalov has reported the following warning triggered by the
      syzkaller fuzzer.
      
        WARNING: CPU: 1 PID: 9935 at mm/page_alloc.c:3511 __alloc_pages_nodemask+0x159c/0x1e20
        Kernel panic - not syncing: panic_on_warn set ...
        CPU: 1 PID: 9935 Comm: syz-executor0 Not tainted 4.9.0-rc7+ #34
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
          __alloc_pages_slowpath mm/page_alloc.c:3511
          __alloc_pages_nodemask+0x159c/0x1e20 mm/page_alloc.c:3781
          alloc_pages_current+0x1c7/0x6b0 mm/mempolicy.c:2072
          alloc_pages include/linux/gfp.h:469
          kmalloc_order+0x1f/0x70 mm/slab_common.c:1015
          kmalloc_order_trace+0x1f/0x160 mm/slab_common.c:1026
          kmalloc_large include/linux/slab.h:422
          __kmalloc+0x210/0x2d0 mm/slub.c:3723
          kmalloc include/linux/slab.h:495
          ep_write_iter+0x167/0xb50 drivers/usb/gadget/legacy/inode.c:664
          new_sync_write fs/read_write.c:499
          __vfs_write+0x483/0x760 fs/read_write.c:512
          vfs_write+0x170/0x4e0 fs/read_write.c:560
          SYSC_write fs/read_write.c:607
          SyS_write+0xfb/0x230 fs/read_write.c:599
          entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      The issue is caused by a lack of size check for the request size in
      ep_write_iter which should be fixed.  It, however, points to another
      problem, that SLUB defines KMALLOC_MAX_SIZE too large because the its
      KMALLOC_SHIFT_MAX is (MAX_ORDER + PAGE_SHIFT) which means that the
      resulting page allocator request might be MAX_ORDER which is too large
      (see __alloc_pages_slowpath).
      
      The same applies to the SLOB allocator which allows even larger sizes.
      Make sure that they are capped properly and never request more than
      MAX_ORDER order.
      
      Link: http://lkml.kernel.org/r/[email protected]Signed-off-by: default avatarMichal Hocko <[email protected]>
      Reported-by: default avatarAndrey Konovalov <[email protected]>
      Acked-by: default avatarChristoph Lameter <[email protected]>
      Cc: Alexei Starovoitov <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      bb1107f7
  7. 06 Sep, 2016 1 commit
  8. 26 Jul, 2016 2 commits
  9. 20 May, 2016 1 commit
    • Rasmus Villemoes's avatar
      include/linux: apply __malloc attribute · 48a27055
      Rasmus Villemoes authored
      Attach the malloc attribute to a few allocation functions.  This helps
      gcc generate better code by telling it that the return value doesn't
      alias any existing pointers (which is even more valuable given the
      pessimizations implied by -fno-strict-aliasing).
      
      A simple example of what this allows gcc to do can be seen by looking at
      the last part of drm_atomic_helper_plane_reset:
      
      	plane->state = kzalloc(sizeof(*plane->state), GFP_KERNEL);
      
      	if (plane->state) {
      		plane->state->plane = plane;
      		plane->state->rotation = BIT(DRM_ROTATE_0);
      	}
      
      which compiles to
      
          e8 99 bf d6 ff          callq  ffffffff8116d540 <kmem_cache_alloc_trace>
          48 85 c0                test   %rax,%rax
          48 89 83 40 02 00 00    mov    %rax,0x240(%rbx)
          74 11                   je     ffffffff814015c4 <drm_atomic_helper_plane_reset+0x64>
          48 89 18                mov    %rbx,(%rax)
          48 8b 83 40 02 00 00    mov    0x240(%rbx),%rax [*]
          c7 40 40 01 00 00 00    movl   $0x1,0x40(%rax)
      
      With this patch applied, the instruction at [*] is elided, since the
      store to plane->state->plane is known to not alter the value of
      plane->state.
      
      [[email protected]: coding-style fixes]
      Signed-off-by: default avatarRasmus Villemoes <[email protected]>
      Cc: Christoph Lameter <[email protected]>
      Cc: Pekka Enberg <[email protected]>
      Cc: David Rientjes <[email protected]>
      Cc: Joonsoo Kim <[email protected]>
      Cc: Andi Kleen <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      48a27055
  10. 25 Mar, 2016 2 commits
  11. 15 Mar, 2016 3 commits
    • Laura Abbott's avatar
      slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS · becfda68
      Laura Abbott authored
      SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned
      on or off.  Expand its use to be able to turn off all consistency
      checks.  This gives a nice speed up if you only want features such as
      poisoning or tracing.
      
      Credit to Mathias Krause for the original work which inspired this
      series
      Signed-off-by: default avatarLaura Abbott <[email protected]>
      Acked-by: default avatarChristoph Lameter <[email protected]>
      Cc: Pekka Enberg <[email protected]>
      Cc: David Rientjes <[email protected]>
      Cc: Joonsoo Kim <j[email protected]>
      Cc: Kees Cook <[email protected]>
      Cc: Mathias Krause <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      becfda68
    • Jesper Dangaard Brouer's avatar
      mm: fix some spelling · 9f706d68
      Jesper Dangaard Brouer authored
      Fix up trivial spelling errors, noticed while reading the code.
      Signed-off-by: default avatarJesper Dangaard Brouer <[email protected]>
      Cc: Christoph Lameter <[email protected]>
      Cc: Pekka Enberg <[email protected]>
      Cc: David Rientjes <[email protected]>
      Cc: Joonsoo Kim <[email protected]>
      Cc: Vladimir Davydov <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      9f706d68
    • Jesper Dangaard Brouer's avatar
      mm: new API kfree_bulk() for SLAB+SLUB allocators · ca257195
      Jesper Dangaard Brouer authored
      This patch introduce a new API call kfree_bulk() for bulk freeing memory
      objects not bound to a single kmem_cache.
      
      Christoph pointed out that it is possible to implement freeing of
      objects, without knowing the kmem_cache pointer as that information is
      available from the object's page->slab_cache.  Proposing to remove the
      kmem_cache argument from the bulk free API.
      
      Jesper demonstrated that these extra steps per object comes at a
      performance cost.  It is only in the case CONFIG_MEMCG_KMEM is compiled
      in and activated runtime that these steps are done anyhow.  The extra
      cost is most visible for SLAB allocator, because the SLUB allocator does
      the page lookup (virt_to_head_page()) anyhow.
      
      Thus, the conclusion was to keep the kmem_cache free bulk API with a
      kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
      easily.  Simply by handling if kmem_cache_free_bulk() gets called with a
      kmem_cache NULL pointer.
      
      This does increase the code size a bit, but implementing a separate
      kfree_bulk() call would likely increase code size even more.
      
      Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
      @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.
      
      Code size increase for SLAB:
      
       add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
       function                                     old     new   delta
       kmem_cache_free_bulk                         660     734     +74
      
      SLAB fastpath: 87 cycles(tsc) 21.814
        sz - fallback             - kmem_cache_free_bulk - kfree_bulk
         1 - 103 cycles 25.878 ns -  41 cycles 10.498 ns - 81 cycles 20.312 ns
         2 -  94 cycles 23.673 ns -  26 cycles  6.682 ns - 42 cycles 10.649 ns
         3 -  92 cycles 23.181 ns -  21 cycles  5.325 ns - 39 cycles 9.950 ns
         4 -  90 cycles 22.727 ns -  18 cycles  4.673 ns - 26 cycles 6.693 ns
         8 -  89 cycles 22.270 ns -  14 cycles  3.664 ns - 23 cycles 5.835 ns
        16 -  88 cycles 22.038 ns -  14 cycles  3.503 ns - 22 cycles 5.543 ns
        30 -  89 cycles 22.284 ns -  13 cycles  3.310 ns - 20 cycles 5.197 ns
        32 -  88 cycles 22.249 ns -  13 cycles  3.420 ns - 20 cycles 5.166 ns
        34 -  88 cycles 22.224 ns -  14 cycles  3.643 ns - 20 cycles 5.170 ns
        48 -  88 cycles 22.088 ns -  14 cycles  3.507 ns - 20 cycles 5.203 ns
        64 -  88 cycles 22.063 ns -  13 cycles  3.428 ns - 20 cycles 5.152 ns
       128 -  89 cycles 22.483 ns -  15 cycles  3.891 ns - 23 cycles 5.885 ns
       158 -  89 cycles 22.381 ns -  15 cycles  3.779 ns - 22 cycles 5.548 ns
       250 -  91 cycles 22.798 ns -  16 cycles  4.152 ns - 23 cycles 5.967 ns
      
      SLAB when enabling MEMCG_KMEM runtime:
       - kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0)
       1 - 148 cycles 37.220 ns -  66 cycles 16.622 ns - 66 cycles 16.583 ns
       2 - 141 cycles 35.510 ns -  51 cycles 12.820 ns - 58 cycles 14.625 ns
       3 - 140 cycles 35.017 ns -  37 cycles 9.326 ns - 33 cycles 8.474 ns
       4 - 137 cycles 34.507 ns -  31 cycles 7.888 ns - 33 cycles 8.300 ns
       8 - 140 cycles 35.069 ns -  25 cycles 6.461 ns - 25 cycles 6.436 ns
       16 - 138 cycles 34.542 ns -  23 cycles 5.945 ns - 22 cycles 5.670 ns
       30 - 136 cycles 34.227 ns -  22 cycles 5.502 ns - 22 cycles 5.587 ns
       32 - 136 cycles 34.253 ns -  21 cycles 5.475 ns - 21 cycles 5.324 ns
       34 - 136 cycles 34.254 ns -  21 cycles 5.448 ns - 20 cycles 5.194 ns
       48 - 136 cycles 34.075 ns -  21 cycles 5.458 ns - 21 cycles 5.367 ns
       64 - 135 cycles 33.994 ns -  21 cycles 5.350 ns - 21 cycles 5.259 ns
       128 - 137 cycles 34.446 ns -  23 cycles 5.816 ns - 22 cycles 5.688 ns
       158 - 137 cycles 34.379 ns -  22 cycles 5.727 ns - 22 cycles 5.602 ns
       250 - 138 cycles 34.755 ns -  24 cycles 6.093 ns - 23 cycles 5.986 ns
      
      Code size increase for SLUB:
       function                                     old     new   delta
       kmem_cache_free_bulk                         717     799     +82
      
      SLUB benchmark:
       SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0)
        sz - fallback             - kmem_cache_free_bulk - kfree_bulk
         1 -  61 cycles 15.486 ns -  53 cycles 13.364 ns - 57 cycles 14.464 ns
         2 -  54 cycles 13.703 ns -  32 cycles  8.110 ns - 33 cycles 8.482 ns
         3 -  53 cycles 13.272 ns -  25 cycles  6.362 ns - 27 cycles 6.947 ns
         4 -  51 cycles 12.994 ns -  24 cycles  6.087 ns - 24 cycles 6.078 ns
         8 -  50 cycles 12.576 ns -  21 cycles  5.354 ns - 22 cycles 5.513 ns
        16 -  49 cycles 12.368 ns -  20 cycles  5.054 ns - 20 cycles 5.042 ns
        30 -  49 cycles 12.273 ns -  18 cycles  4.748 ns - 19 cycles 4.758 ns
        32 -  49 cycles 12.401 ns -  19 cycles  4.821 ns - 19 cycles 4.810 ns
        34 -  98 cycles 24.519 ns -  24 cycles  6.154 ns - 24 cycles 6.157 ns
        48 -  83 cycles 20.833 ns -  21 cycles  5.446 ns - 21 cycles 5.429 ns
        64 -  75 cycles 18.891 ns -  20 cycles  5.247 ns - 20 cycles 5.238 ns
       128 -  93 cycles 23.271 ns -  27 cycles  6.856 ns - 27 cycles 6.823 ns
       158 - 102 cycles 25.581 ns -  30 cycles  7.714 ns - 30 cycles 7.695 ns
       250 - 107 cycles 26.917 ns -  38 cycles  9.514 ns - 38 cycles 9.506 ns
      
      SLUB when enabling MEMCG_KMEM runtime:
       - kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0)
       1 - 85 cycles 21.484 ns -  78 cycles 19.569 ns - 75 cycles 18.938 ns
       2 - 81 cycles 20.363 ns -  45 cycles 11.258 ns - 44 cycles 11.076 ns
       3 - 78 cycles 19.709 ns -  33 cycles 8.354 ns - 32 cycles 8.044 ns
       4 - 77 cycles 19.430 ns -  28 cycles 7.216 ns - 28 cycles 7.003 ns
       8 - 101 cycles 25.288 ns -  23 cycles 5.849 ns - 23 cycles 5.787 ns
       16 - 76 cycles 19.148 ns -  20 cycles 5.162 ns - 20 cycles 5.081 ns
       30 - 76 cycles 19.067 ns -  19 cycles 4.868 ns - 19 cycles 4.821 ns
       32 - 76 cycles 19.052 ns -  19 cycles 4.857 ns - 19 cycles 4.815 ns
       34 - 121 cycles 30.291 ns -  25 cycles 6.333 ns - 25 cycles 6.268 ns
       48 - 108 cycles 27.111 ns -  21 cycles 5.498 ns - 21 cycles 5.458 ns
       64 - 100 cycles 25.164 ns -  20 cycles 5.242 ns - 20 cycles 5.229 ns
       128 - 155 cycles 38.976 ns -  27 cycles 6.886 ns - 27 cycles 6.892 ns
       158 - 132 cycles 33.034 ns -  30 cycles 7.711 ns - 30 cycles 7.728 ns
       250 - 130 cycles 32.612 ns -  38 cycles 9.560 ns - 38 cycles 9.549 ns
      Signed-off-by: default avatarJesper Dangaard Brouer <[email protected]>
      Cc: Christoph Lameter <[email protected]>
      Cc: Pekka Enberg <[email protected]>
      Cc: David Rientjes <[email protected]>
      Cc: Joonsoo Kim <[email protected]>
      Cc: Vladimir Davydov <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      ca257195
  12. 21 Jan, 2016 1 commit
  13. 15 Jan, 2016 1 commit
  14. 22 Nov, 2015 1 commit
  15. 21 Nov, 2015 1 commit
    • Rasmus Villemoes's avatar
      slab.h: sprinkle __assume_aligned attributes · 94a58c36
      Rasmus Villemoes authored
      The various allocators return aligned memory.  Telling the compiler that
      allows it to generate better code in many cases, for example when the
      return value is immediately passed to memset().
      
      Some code does become larger, but at least we win twice as much as we lose:
      
      $ scripts/bloat-o-meter /tmp/vmlinux vmlinux
      add/remove: 0/0 grow/shrink: 13/52 up/down: 995/-2140 (-1145)
      
      An example of the different (and smaller) code can be seen in mm_alloc(). Before:
      
      :       48 8d 78 08             lea    0x8(%rax),%rdi
      :       48 89 c1                mov    %rax,%rcx
      :       48 89 c2                mov    %rax,%rdx
      :       48 c7 00 00 00 00 00    movq   $0x0,(%rax)
      :       48 c7 80 48 03 00 00    movq   $0x0,0x348(%rax)
      :       00 00 00 00
      :       31 c0                   xor    %eax,%eax
      :       48 83 e7 f8             and    $0xfffffffffffffff8,%rdi
      :       48 29 f9                sub    %rdi,%rcx
      :       81 c1 50 03 00 00       add    $0x350,%ecx
      :       c1 e9 03                shr    $0x3,%ecx
      :       f3 48 ab                rep stos %rax,%es:(%rdi)
      
      After:
      
      :       48 89 c2                mov    %rax,%rdx
      :       b9 6a 00 00 00          mov    $0x6a,%ecx
      :       31 c0                   xor    %eax,%eax
      :       48 89 d7                mov    %rdx,%rdi
      :       f3 48 ab                rep stos %rax,%es:(%rdi)
      
      So gcc's strategy is to do two possibly (but not really, of course)
      unaligned stores to the first and last word, then do an aligned rep stos
      covering the middle part with a little overlap.  Maybe arches which do not
      allow unaligned stores gain even more.
      
      I don't know if gcc can actually make use of alignments greater than 8 for
      anything, so one could probably drop the __assume_xyz_alignment macros and
      just use __assume_aligned(8).
      
      The increases in code size are mostly caused by gcc deciding to
      opencode strlen() using the check-four-bytes-at-a-time trick when it
      knows the buffer is sufficiently aligned (one function grew by 200
      bytes). Now it turns out that many of these strlen() calls showing up
      were in fact redundant, and they're gone from -next. Applying the two
      patches to next-20151001 bloat-o-meter instead says
      
      add/remove: 0/0 grow/shrink: 6/52 up/down: 244/-2140 (-1896)
      Signed-off-by: default avatarRasmus Villemoes <[email protected]>
      Acked-by: default avatarChristoph Lameter <[email protected]>
      Cc: David Rientjes <[email protected]>
      Cc: Pekka Enberg <[email protected]>
      Cc: Joonsoo Kim <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      94a58c36
  16. 06 Nov, 2015 1 commit
  17. 04 Sep, 2015 1 commit
  18. 29 Jun, 2015 1 commit
  19. 25 Jun, 2015 2 commits
  20. 14 Apr, 2015 1 commit
  21. 14 Feb, 2015 1 commit
  22. 13 Feb, 2015 3 commits
  23. 10 Feb, 2015 2 commits
  24. 13 Dec, 2014 1 commit
    • Vladimir Davydov's avatar
      memcg: fix possible use-after-free in memcg_kmem_get_cache() · 8135be5a
      Vladimir Davydov authored
      Suppose task @t that belongs to a memory cgroup @memcg is going to
      allocate an object from a kmem cache @c.  The copy of @c corresponding to
      @memcg, @mc, is empty.  Then if kmem_cache_alloc races with the memory
      cgroup destruction we can access the memory cgroup's copy of the cache
      after it was destroyed:
      
      CPU0				CPU1
      ----				----
      [ current=@t
        @mc->memcg_params->nr_pages=0 ]
      
      kmem_cache_alloc(@c):
        call memcg_kmem_get_cache(@c);
        proceed to allocation from @mc:
          alloc a page for @mc:
            ...
      
      				move @t from @memcg
      				destroy @memcg:
      				  mem_cgroup_css_offline(@memcg):
      				    memcg_unregister_all_caches(@memcg):
      				      kmem_cache_destroy(@mc)
      
          add page to @mc
      
      We could fix this issue by taking a reference to a per-memcg cache, but
      that would require adding a per-cpu reference counter to per-memcg caches,
      which would look cumbersome.
      
      Instead, let's take a reference to a memory cgroup, which already has a
      per-cpu reference counter, in the beginning of kmem_cache_alloc to be
      dropped in the end, and move per memcg caches destruction from css offline
      to css free.  As a side effect, per-memcg caches will be destroyed not one
      by one, but all at once when the last page accounted to the memory cgroup
      is freed.  This doesn't sound as a high price for code readability though.
      
      Note, this patch does add some overhead to the kmem_cache_alloc hot path,
      but it is pretty negligible - it's just a function call plus a per cpu
      counter decrement, which is comparable to what we already have in
      memcg_kmem_get_cache.  Besides, it's only relevant if there are memory
      cgroups with kmem accounting enabled.  I don't think we can find a way to
      handle this race w/o it, because alloc_page called from kmem_cache_alloc
      may sleep so we can't flush all pending kmallocs w/o reference counting.
      Signed-off-by: default avatarVladimir Davydov <[email protected]>
      Acked-by: default avatarChristoph Lameter <[email protected]>
      Cc: Johannes Weiner <[email protected]>
      Cc: Michal Hocko <[email protected]>
      Cc: Pekka Enberg <[email protected]>
      Cc: David Rientjes <[email protected]>
      Cc: Joonsoo Kim <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      8135be5a
  25. 11 Dec, 2014 1 commit
  26. 10 Oct, 2014 1 commit