1. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  2. 08 Oct, 2016 5 commits
  3. 28 Jul, 2016 2 commits
    • Vlastimil Babka's avatar
      mm, compaction: simplify contended compaction handling · c3486f53
      Vlastimil Babka authored
      Async compaction detects contention either due to failing trylock on
      zone->lock or lru_lock, or by need_resched().  Since 1f9efdef ("mm,
      compaction: khugepaged should not give up due to need_resched()") the
      code got quite complicated to distinguish these two up to the
      __alloc_pages_slowpath() level, so different decisions could be taken
      for khugepaged allocations.
      
      After the recent changes, khugepaged allocations don't check for
      contended compaction anymore, so we again don't need to distinguish lock
      and sched contention, and simplify the current convoluted code a lot.
      
      However, I believe it's also possible to simplify even more and
      completely remove the check for contended compaction after the initial
      async compaction for costly orders, which was originally aimed at THP
      page fault allocations.  There are several reasons why this can be done
      now:
      
      - with the new defaults, THP page faults no longer do reclaim/compaction at
        all, unless the system admin has overridden the default, or application has
        indicated via madvise that it can benefit from THP's. In both cases, it
        means that the potential extra latency is expected and worth the benefits.
      - even if reclaim/compaction proceeds after this patch where it previously
        wouldn't, the second compaction attempt is still async and will detect the
        contention and back off, if the contention persists
      - there are still heuristics like deferred compaction and pageblock skip bits
        in place that prevent excessive THP page fault latencies
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: Mel Gorman's avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3486f53
    • Vlastimil Babka's avatar
      mm, compaction: introduce direct compaction priority · a5508cd8
      Vlastimil Babka authored
      In the context of direct compaction, for some types of allocations we
      would like the compaction to either succeed or definitely fail while
      trying as hard as possible.  Current async/sync_light migration mode is
      insufficient, as there are heuristics such as caching scanner positions,
      marking pageblocks as unsuitable or deferring compaction for a zone.  At
      least the final compaction attempt should be able to override these
      heuristics.
      
      To communicate how hard compaction should try, we replace migration mode
      with a new enum compact_priority and change the relevant function
      signatures.  In compact_zone_order() where struct compact_control is
      constructed, the priority is mapped to suitable control flags.  This
      patch itself has no functional change, as the current priority levels
      are mapped back to the same migration modes as before.  Expanding them
      will be done next.
      
      Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
      removed, as the only caller exists under CONFIG_COMPACTION.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: Mel Gorman's avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5508cd8
  4. 26 Jul, 2016 2 commits
    • Minchan Kim's avatar
      mm: fix build warnings in <linux/compaction.h> · dd4123f3
      Minchan Kim authored
      Randy reported below build error.
      
      > In file included from ../include/linux/balloon_compaction.h:48:0,
      >                  from ../mm/balloon_compaction.c:11:
      > ../include/linux/compaction.h:237:51: warning: 'struct node' declared inside parameter list [enabled by default]
      >  static inline int compaction_register_node(struct node *node)
      > ../include/linux/compaction.h:237:51: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
      > ../include/linux/compaction.h:242:54: warning: 'struct node' declared inside parameter list [enabled by default]
      >  static inline void compaction_unregister_node(struct node *node)
      >
      
      It was caused by non-lru page migration which needs compaction.h but
      compaction.h doesn't include any header to be standalone.
      
      I think proper header for non-lru page migration is migrate.h rather
      than compaction.h because migrate.h has already headers needed to work
      non-lru page migration indirectly like isolate_mode_t, migrate_mode
      MIGRATEPAGE_SUCCESS.
      
      [akpm@linux-foundation.org: revert mm-balloon-use-general-non-lru-movable-page-feature-fix.patch temp fix]
      Link: http://lkml.kernel.org/r/20160610003304.GE29779@bboxSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd4123f3
    • Minchan Kim's avatar
      mm: migrate: support non-lru movable page migration · bda807d4
      Minchan Kim authored
      We have allowed migration for only LRU pages until now and it was enough
      to make high-order pages.  But recently, embedded system(e.g., webOS,
      android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
      have seen several reports about troubles of small high-order allocation.
      For fixing the problem, there were several efforts (e,g,.  enhance
      compaction algorithm, SLUB fallback to 0-order page, reserved memory,
      vmalloc and so on) but if there are lots of non-movable pages in system,
      their solutions are void in the long run.
      
      So, this patch is to support facility to change non-movable pages with
      movable.  For the feature, this patch introduces functions related to
      migration to address_space_operations as well as some page flags.
      
      If a driver want to make own pages movable, it should define three
      functions which are function pointers of struct
      address_space_operations.
      
      1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
      
      What VM expects on isolate_page function of driver is to return *true*
      if driver isolates page successfully.  On returing true, VM marks the
      page as PG_isolated so concurrent isolation in several CPUs skip the
      page for isolation.  If a driver cannot isolate the page, it should
      return *false*.
      
      Once page is successfully isolated, VM uses page.lru fields so driver
      shouldn't expect to preserve values in that fields.
      
      2. int (*migratepage) (struct address_space *mapping,
      		struct page *newpage, struct page *oldpage, enum migrate_mode);
      
      After isolation, VM calls migratepage of driver with isolated page.  The
      function of migratepage is to move content of the old page to new page
      and set up fields of struct page newpage.  Keep in mind that you should
      indicate to the VM the oldpage is no longer movable via
      __ClearPageMovable() under page_lock if you migrated the oldpage
      successfully and returns 0.  If driver cannot migrate the page at the
      moment, driver can return -EAGAIN.  On -EAGAIN, VM will retry page
      migration in a short time because VM interprets -EAGAIN as "temporal
      migration failure".  On returning any error except -EAGAIN, VM will give
      up the page migration without retrying in this time.
      
      Driver shouldn't touch page.lru field VM using in the functions.
      
      3. void (*putback_page)(struct page *);
      
      If migration fails on isolated page, VM should return the isolated page
      to the driver so VM calls driver's putback_page with migration failed
      page.  In this function, driver should put the isolated page back to the
      own data structure.
      
      4. non-lru movable page flags
      
      There are two page flags for supporting non-lru movable page.
      
      * PG_movable
      
      Driver should use the below function to make page movable under
      page_lock.
      
      	void __SetPageMovable(struct page *page, struct address_space *mapping)
      
      It needs argument of address_space for registering migration family
      functions which will be called by VM.  Exactly speaking, PG_movable is
      not a real flag of struct page.  Rather than, VM reuses page->mapping's
      lower bits to represent it.
      
      	#define PAGE_MAPPING_MOVABLE 0x2
      	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
      
      so driver shouldn't access page->mapping directly.  Instead, driver
      should use page_mapping which mask off the low two bits of page->mapping
      so it can get right struct address_space.
      
      For testing of non-lru movable page, VM supports __PageMovable function.
      However, it doesn't guarantee to identify non-lru movable page because
      page->mapping field is unified with other variables in struct page.  As
      well, if driver releases the page after isolation by VM, page->mapping
      doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
      __ClearPageMovable).  But __PageMovable is cheap to catch whether page
      is LRU or non-lru movable once the page has been isolated.  Because LRU
      pages never can have PAGE_MAPPING_MOVABLE in page->mapping.  It is also
      good for just peeking to test non-lru movable pages before more
      expensive checking with lock_page in pfn scanning to select victim.
      
      For guaranteeing non-lru movable page, VM provides PageMovable function.
      Unlike __PageMovable, PageMovable functions validates page->mapping and
      mapping->a_ops->isolate_page under lock_page.  The lock_page prevents
      sudden destroying of page->mapping.
      
      Driver using __SetPageMovable should clear the flag via
      __ClearMovablePage under page_lock before the releasing the page.
      
      * PG_isolated
      
      To prevent concurrent isolation among several CPUs, VM marks isolated
      page as PG_isolated under lock_page.  So if a CPU encounters PG_isolated
      non-lru movable page, it can skip it.  Driver doesn't need to manipulate
      the flag because VM will set/clear it automatically.  Keep in mind that
      if driver sees PG_isolated page, it means the page have been isolated by
      VM so it shouldn't touch page.lru field.  PG_isolated is alias with
      PG_reclaim flag so driver shouldn't use the flag for own purpose.
      
      [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
        Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
      Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.orgSigned-off-by: default avatarGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: John Einar Reitan <john.reitan@foss.arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bda807d4
  5. 21 May, 2016 6 commits
    • Michal Hocko's avatar
      mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders · 86a294a8
      Michal Hocko authored
      "mm: consider compaction feedback also for costly allocation" has
      removed the upper bound for the reclaim/compaction retries based on the
      number of reclaimed pages for costly orders.  While this is desirable
      the patch did miss a mis interaction between reclaim, compaction and the
      retry logic.  The direct reclaim tries to get zones over min watermark
      while compaction backs off and returns COMPACT_SKIPPED when all zones
      are below low watermark + 1<<order gap.  If we are getting really close
      to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
      high order request (e.g.  hugetlb order-9) while the reclaim is not able
      to release enough pages to get us over low watermark.  The reclaim is
      still able to make some progress (usually trashing over few remaining
      pages) so we are not able to break out from the loop.
      
      I have seen this happening with the same test described in "mm: consider
      compaction feedback also for costly allocation" on a swapless system.
      The original problem got resolved by "vmscan: consider classzone_idx in
      compaction_ready" but it shows how things might go wrong when we
      approach the oom event horizont.
      
      The reason why compaction requires being over low rather than min
      watermark is not clear to me.  This check was there essentially since
      56de7263 ("mm: compaction: direct compact when a high-order
      allocation fails").  It is clearly an implementation detail though and
      we shouldn't pull it into the generic retry logic while we should be
      able to cope with such eventuality.  The only place in
      should_compact_retry where we retry without any upper bound is for
      compaction_withdrawn() case.
      
      Introduce compaction_zonelist_suitable function which checks the given
      zonelist and returns true only if there is at least one zone which would
      would unblock __compaction_suitable if more memory got reclaimed.  In
      this implementation it checks __compaction_suitable with NR_FREE_PAGES
      plus part of the reclaimable memory as the target for the watermark
      check.  The reclaimable memory is reduced linearly by the allocation
      order.  The idea is that we do not want to reclaim all the remaining
      memory for a single allocation request just unblock
      __compaction_suitable which doesn't guarantee we will make a further
      progress.
      
      The new helper is then used if compaction_withdrawn() feedback was
      provided so we do not retry if there is no outlook for a further
      progress.  !costly requests shouldn't be affected much - e.g.  order-2
      pages would require to have at least 64kB on the reclaimable LRUs while
      order-9 would need at least 32M which should be enough to not lock up.
      
      [vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in compaction_zonelist_suitable]
      [akpm@linux-foundation.org: fix it for Mel's mm-page_alloc-remove-field-from-alloc_context.patch]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86a294a8
    • Michal Hocko's avatar
      mm, compaction: abstract compaction feedback to helpers · cab1802b
      Michal Hocko authored
      Compaction can provide a wild variation of feedback to the caller.  Many
      of them are implementation specific and the caller of the compaction
      (especially the page allocator) shouldn't be bound to specifics of the
      current implementation.
      
      This patch abstracts the feedback into three basic types:
      	- compaction_made_progress - compaction was active and made some
      	  progress.
      	- compaction_failed - compaction failed and further attempts to
      	  invoke it would most probably fail and therefore it is not
      	  worth retrying
      	- compaction_withdrawn - compaction wasn't invoked for an
                implementation specific reasons. In the current implementation
                it means that the compaction was deferred, contended or the
                page scanners met too early without any progress. Retrying is
                still worthwhile.
      
      [vbabka@suse.cz: do not change thp back off behavior]
      [akpm@linux-foundation.org: fix typo in comment, per Hillf]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cab1802b
    • Michal Hocko's avatar
      mm, compaction: update compaction_result ordering · 4f9a358c
      Michal Hocko authored
      compaction_result will be used as the primary feedback channel for
      compaction users.  At the same time try_to_compact_pages (and
      potentially others) assume a certain ordering where a more specific
      feedback takes precendence.
      
      This gets a bit awkward when we have conflicting feedback from different
      zones.  E.g one returing COMPACT_COMPLETE meaning the full zone has been
      scanned without any outcome while other returns with COMPACT_PARTIAL aka
      made some progress.  The caller should get COMPACT_PARTIAL because that
      means that the compaction still can make some progress.  The same
      applies for COMPACT_PARTIAL vs COMPACT_PARTIAL_SKIPPED.
      
      Reorder PARTIAL to be the largest one so the larger the value is the
      more progress we have done.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f9a358c
    • Michal Hocko's avatar
      mm, compaction: distinguish between full and partial COMPACT_COMPLETE · c8f7de0b
      Michal Hocko authored
      COMPACT_COMPLETE now means that compaction and free scanner met.  This
      is not very useful information if somebody just wants to use this
      feedback and make any decisions based on that.  The current caller might
      be a poor guy who just happened to scan tiny portion of the zone and
      that could be the reason no suitable pages were compacted.  Make sure we
      distinguish the full and partial zone walks.
      
      Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
      and be optimistic in retrying.
      
      The existing users of COMPACT_COMPLETE are conservatively changed to use
      COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
      reconsidered and only defer the compaction only for COMPACT_COMPLETE
      with the new semantic.
      
      This patch shouldn't introduce any functional changes.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8f7de0b
    • Michal Hocko's avatar
      mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED · 1d4746d3
      Michal Hocko authored
      try_to_compact_pages() can currently return COMPACT_SKIPPED even when
      the compaction is defered for some zone just because zone DMA is skipped
      in 99% of cases due to watermark checks.  This makes COMPACT_DEFERRED
      basically unusable for the page allocator as a feedback mechanism.
      
      Make sure we distinguish those two states properly and switch their
      ordering in the enum.  This would mean that the COMPACT_SKIPPED will be
      returned only when all eligible zones are skipped.
      
      As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
      will be more precise and we would bail out rather than reclaim.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d4746d3
    • Michal Hocko's avatar
      mm, compaction: change COMPACT_ constants into enum · ea7ab982
      Michal Hocko authored
      Compaction code is doing weird dances between COMPACT_FOO -> int ->
      unsigned long
      
      But there doesn't seem to be any reason for that.  All functions which
      return/use one of those constants are not expecting any other value so it
      really makes sense to define an enum for them and make it clear that no
      other values are expected.
      
      This is a pure cleanup and shouldn't introduce any functional changes.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea7ab982
  6. 20 May, 2016 1 commit
  7. 17 Mar, 2016 1 commit
    • Vlastimil Babka's avatar
      mm, compaction: introduce kcompactd · 698b1b30
      Vlastimil Babka authored
      Memory compaction can be currently performed in several contexts:
      
       - kswapd balancing a zone after a high-order allocation failure
       - direct compaction to satisfy a high-order allocation, including THP
         page fault attemps
       - khugepaged trying to collapse a hugepage
       - manually from /proc
      
      The purpose of compaction is two-fold.  The obvious purpose is to
      satisfy a (pending or future) high-order allocation, and is easy to
      evaluate.  The other purpose is to keep overal memory fragmentation low
      and help the anti-fragmentation mechanism.  The success wrt the latter
      purpose is more
      
      The current situation wrt the purposes has a few drawbacks:
      
       - compaction is invoked only when a high-order page or hugepage is not
         available (or manually).  This might be too late for the purposes of
         keeping memory fragmentation low.
       - direct compaction increases latency of allocations.  Again, it would
         be better if compaction was performed asynchronously to keep
         fragmentation low, before the allocation itself comes.
       - (a special case of the previous) the cost of compaction during THP
         page faults can easily offset the benefits of THP.
       - kswapd compaction appears to be complex, fragile and not working in
         some scenarios.  It could also end up compacting for a high-order
         allocation request when it should be reclaiming memory for a later
         order-0 request.
      
      To improve the situation, we should be able to benefit from an
      equivalent of kswapd, but for compaction - i.e. a background thread
      which responds to fragmentation and the need for high-order allocations
      (including hugepages) somewhat proactively.
      
      One possibility is to extend the responsibilities of kswapd, which could
      however complicate its design too much.  It should be better to let
      kswapd handle reclaim, as order-0 allocations are often more critical
      than high-order ones.
      
      Another possibility is to extend khugepaged, but this kthread is a
      single instance and tied to THP configs.
      
      This patch goes with the option of a new set of per-node kthreads called
      kcompactd, and lays the foundations, without introducing any new
      tunables.  The lifecycle mimics kswapd kthreads, including the memory
      hotplug hooks.
      
      For compaction, kcompactd uses the standard compaction_suitable() and
      ompact_finished() criteria and the deferred compaction functionality.
      Unlike direct compaction, it uses only sync compaction, as there's no
      allocation latency to minimize.
      
      This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
      compact/reclaim loop for high-order pages will be replaced by waking up
      kcompactd in the next patch with the description of what's wrong with
      the old approach.
      
      Waking up of the kcompactd threads is also tied to kswapd activity and
      follows these rules:
       - we don't want to affect any fastpaths, so wake up kcompactd only from
         the slowpath, as it's done for kswapd
       - if kswapd is doing reclaim, it's more important than compaction, so
         don't invoke kcompactd until kswapd goes to sleep
       - the target order used for kswapd is passed to kcompactd
      
      Future possible future uses for kcompactd include the ability to wake up
      kcompactd on demand in special situations, such as when hugepages are
      not available (currently not done due to __GFP_NO_KSWAPD) or when a
      fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
      possible to perform periodic compaction with kcompactd.
      
      [arnd@arndb.de: fix build errors with kcompactd]
      [paul.gortmaker@windriver.com: don't use modular references for non modular code]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      698b1b30
  8. 06 Nov, 2015 2 commits
    • Vlastimil Babka's avatar
      mm, compaction: distinguish contended status in tracepoints · 2d1e1041
      Vlastimil Babka authored
      Compaction returns prematurely with COMPACT_PARTIAL when contended or has
      fatal signal pending.  This is ok for the callers, but might be misleading
      in the traces, as the usual reason to return COMPACT_PARTIAL is that we
      think the allocation should succeed.  After this patch we distinguish the
      premature ending condition in the mm_compaction_finished and
      mm_compaction_end tracepoints.
      
      The contended status covers the following reasons:
      - lock contention or need_resched() detected in async compaction
      - fatal signal pending
      - too many pages isolated in the zone (only for async compaction)
      Further distinguishing the exact reason seems unnecessary for now.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d1e1041
    • Vlastimil Babka's avatar
      mm, compaction: export tracepoints status strings to userspace · fa6c7b46
      Vlastimil Babka authored
      Some compaction tracepoints convert the integer return values to strings
      using the compaction_status_string array.  This works for in-kernel
      printing, but not userspace trace printing of raw captured trace such as
      via trace-cmd report.
      
      This patch converts the private array to appropriate tracepoint macros
      that result in proper userspace support.
      
      trace-cmd output before:
      transhuge-stres-4235  [000]   453.149280: mm_compaction_finished: node=0
        zone=ffffffff81815d7a order=9 ret=
      
      after:
      transhuge-stres-4235  [000]   453.149280: mm_compaction_finished: node=0
        zone=ffffffff81815d7a order=9 ret=partial
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa6c7b46
  9. 15 Apr, 2015 1 commit
    • Eric B Munson's avatar
      mm: allow compaction of unevictable pages · 5bbe3547
      Eric B Munson authored
      Currently, pages which are marked as unevictable are protected from
      compaction, but not from other types of migration.  The POSIX real time
      extension explicitly states that mlock() will prevent a major page
      fault, but the spirit of this is that mlock() should give a process the
      ability to control sources of latency, including minor page faults.
      However, the mlock manpage only explicitly says that a locked page will
      not be written to swap and this can cause some confusion.  The
      compaction code today does not give a developer who wants to avoid swap
      but wants to have large contiguous areas available any method to achieve
      this state.  This patch introduces a sysctl for controlling compaction
      behavior with respect to the unevictable lru.  Users who demand no page
      faults after a page is present can set compact_unevictable_allowed to 0
      and users who need the large contiguous areas can enable compaction on
      locked memory by leaving the default value of 1.
      
      To illustrate this problem I wrote a quick test program that mmaps a
      large number of 1MB files filled with random data.  These maps are
      created locked and read only.  Then every other mmap is unmapped and I
      attempt to allocate huge pages to the static huge page pool.  When the
      compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
      after fragmenting memory.  When the value is set to 1, allocations
      succeed.
      Signed-off-by: default avatarEric B Munson <emunson@akamai.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5bbe3547
  10. 12 Feb, 2015 4 commits
  11. 11 Dec, 2014 2 commits
    • Vlastimil Babka's avatar
      mm, compaction: simplify deferred compaction · 97d47a65
      Vlastimil Babka authored
      Since commit 53853e2d ("mm, compaction: defer each zone individually
      instead of preferred zone"), compaction is deferred for each zone where
      sync direct compaction fails, and reset where it succeeds.  However, it
      was observed that for DMA zone compaction often appeared to succeed
      while subsequent allocation attempt would not, due to different outcome
      of watermark check.
      
      In order to properly defer compaction in this zone, the candidate zone
      has to be passed back to __alloc_pages_direct_compact() and compaction
      deferred in the zone after the allocation attempt fails.
      
      The large source of mismatch between watermark check in compaction and
      allocation was the lack of alloc_flags and classzone_idx values in
      compaction, which has been fixed in the previous patch.  So with this
      problem fixed, we can simplify the code by removing the candidate_zone
      parameter and deferring in __alloc_pages_direct_compact().
      
      After this patch, the compaction activity during stress-highalloc
      benchmark is still somewhat increased, but it's negligible compared to the
      increase that occurred without the better watermark checking.  This
      suggests that it is still possible to apparently succeed in compaction but
      fail to allocate, possibly due to parallel allocation activity.
      
      [akpm@linux-foundation.org: fix build]
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97d47a65
    • Vlastimil Babka's avatar
      mm, compaction: pass classzone_idx and alloc_flags to watermark checking · ebff3980
      Vlastimil Babka authored
      Compaction relies on zone watermark checks for decisions such as if it's
      worth to start compacting in compaction_suitable() or whether compaction
      should stop in compact_finished().  The watermark checks take
      classzone_idx and alloc_flags parameters, which are related to the memory
      allocation request.  But from the context of compaction they are currently
      passed as 0, including the direct compaction which is invoked to satisfy
      the allocation request, and could therefore know the proper values.
      
      The lack of proper values can lead to mismatch between decisions taken
      during compaction and decisions related to the allocation request.  Lack
      of proper classzone_idx value means that lowmem_reserve is not taken into
      account.  This has manifested (during recent changes to deferred
      compaction) when DMA zone was used as fallback for preferred Normal zone.
      compaction_suitable() without proper classzone_idx would think that the
      watermarks are already satisfied, but watermark check in
      get_page_from_freelist() would fail.  Because of this problem, deferring
      compaction has extra complexity that can be removed in the following
      patch.
      
      The issue (not confirmed in practice) with missing alloc_flags is opposite
      in nature.  For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
      ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
      CMA-enabled systems) the watermark checking in compaction with 0 passed
      will be stricter than in get_page_from_freelist().  In these cases
      compaction might be running for a longer time than is really needed.
      
      Another issue compaction_suitable() is that the check for "does the zone
      need compaction at all?" comes only after the check "does the zone have
      enough free free pages to succeed compaction".  The latter considers extra
      pages for migration and can therefore in some situations fail and return
      COMPACT_SKIPPED, although the high-order allocation would succeed and we
      should return COMPACT_PARTIAL.
      
      This patch fixes these problems by adding alloc_flags and classzone_idx to
      struct compact_control and related functions involved in direct compaction
      and watermark checking.  Where possible, all other callers of
      compaction_suitable() pass proper values where those are known.  This is
      currently limited to classzone_idx, which is sometimes known in kswapd
      context.  However, the direct reclaim callers should_continue_reclaim()
      and compaction_ready() do not currently know the proper values, so the
      coordination between reclaim and compaction may still not be as accurate
      as it could.  This can be fixed later, if it's shown to be an issue.
      
      Additionaly the checks in compact_suitable() are reordered to address the
      second issue described above.
      
      The effect of this patch should be slightly better high-order allocation
      success rates and/or less compaction overhead, depending on the type of
      allocations and presence of CMA.  It allows simplifying deferred
      compaction code in a followup patch.
      
      When testing with stress-highalloc, there was some slight improvement
      (which might be just due to variance) in success rates of non-THP-like
      allocations.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebff3980
  12. 10 Oct, 2014 2 commits
    • Vlastimil Babka's avatar
      mm, compaction: khugepaged should not give up due to need_resched() · 1f9efdef
      Vlastimil Babka authored
      Async compaction aborts when it detects zone lock contention or
      need_resched() is true.  David Rientjes has reported that in practice,
      most direct async compactions for THP allocation abort due to
      need_resched().  This means that a second direct compaction is never
      attempted, which might be OK for a page fault, but khugepaged is intended
      to attempt a sync compaction in such case and in these cases it won't.
      
      This patch replaces "bool contended" in compact_control with an int that
      distinguishes between aborting due to need_resched() and aborting due to
      lock contention.  This allows propagating the abort through all compaction
      functions as before, but passing the abort reason up to
      __alloc_pages_slowpath() which decides when to continue with direct
      reclaim and another compaction attempt.
      
      Another problem is that try_to_compact_pages() did not act upon the
      reported contention (both need_resched() or lock contention) immediately
      and would proceed with another zone from the zonelist.  When
      need_resched() is true, that means initializing another zone compaction,
      only to check again need_resched() in isolate_migratepages() and aborting.
       For zone lock contention, the unintended consequence is that the lock
      contended status reported back to the allocator is detrmined from the last
      zone where compaction was attempted, which is rather arbitrary.
      
      This patch fixes the problem in the following way:
      - async compaction of a zone aborting due to need_resched() or fatal signal
        pending means that further zones should not be tried. We report
        COMPACT_CONTENDED_SCHED to the allocator.
      - aborting zone compaction due to lock contention means we can still try
        another zone, since it has different set of locks. We report back
        COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
        it was aborted due to lock contention.
      
      As a result of these fixes, khugepaged will proceed with second sync
      compaction as intended, when the preceding async compaction aborted due to
      need_resched().  Page fault compactions aborting due to need_resched()
      will spare some cycles previously wasted by initializing another zone
      compaction only to abort again.  Lock contention will be reported only
      when compaction in all zones aborted due to lock contention, and therefore
      it's not a good idea to try again after reclaim.
      
      In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
      has improved number of THP collapse allocations by 10%, which shows
      positive effect on khugepaged.  The benchmark's success rates are
      unchanged as it is not recognized as khugepaged.  Numbers of compact_stall
      and compact_fail events have however decreased by 20%, with
      compact_success still a bit improved, which is good.  With benchmark
      configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
      collapse allocations, and only slight improvement in stalls and failures.
      
      [akpm@linux-foundation.org: fix warnings]
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f9efdef
    • Vlastimil Babka's avatar
      mm, compaction: defer each zone individually instead of preferred zone · 53853e2d
      Vlastimil Babka authored
      When direct sync compaction is often unsuccessful, it may become deferred
      for some time to avoid further useless attempts, both sync and async.
      Successful high-order allocations un-defer compaction, while further
      unsuccessful compaction attempts prolong the compaction deferred period.
      
      Currently the checking and setting deferred status is performed only on
      the preferred zone of the allocation that invoked direct compaction.  But
      compaction itself is attempted on all eligible zones in the zonelist, so
      the behavior is suboptimal and may lead both to scenarios where 1)
      compaction is attempted uselessly, or 2) where it's not attempted despite
      good chances of succeeding, as shown on the examples below:
      
      1) A direct compaction with Normal preferred zone failed and set
         deferred compaction for the Normal zone.  Another unrelated direct
         compaction with DMA32 as preferred zone will attempt to compact DMA32
         zone even though the first compaction attempt also included DMA32 zone.
      
         In another scenario, compaction with Normal preferred zone failed to
         compact Normal zone, but succeeded in the DMA32 zone, so it will not
         defer compaction.  In the next attempt, it will try Normal zone which
         will fail again, instead of skipping Normal zone and trying DMA32
         directly.
      
      2) Kswapd will balance DMA32 zone and reset defer status based on
         watermarks looking good.  A direct compaction with preferred Normal
         zone will skip compaction of all zones including DMA32 because Normal
         was still deferred.  The allocation might have succeeded in DMA32, but
         won't.
      
      This patch makes compaction deferring work on individual zone basis
      instead of preferred zone.  For each zone, it checks compaction_deferred()
      to decide if the zone should be skipped.  If watermarks fail after
      compacting the zone, defer_compaction() is called.  The zone where
      watermarks passed can still be deferred when the allocation attempt is
      unsuccessful.  When allocation is successful, compaction_defer_reset() is
      called for the zone containing the allocated page.  This approach should
      approximate calling defer_compaction() only on zones where compaction was
      attempted and did not yield allocated page.  There might be corner cases
      but that is inevitable as long as the decision to stop compacting dues not
      guarantee that a page will be allocated.
      
      Due to a new COMPACT_DEFERRED return value, some functions relying
      implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
      more accurate.  The did_some_progress output parameter of
      __alloc_pages_direct_compact() is removed completely, as the caller
      actually does not use it after compaction sets it - it is only considered
      when direct reclaim sets it.
      
      During testing on a two-node machine with a single very small Normal zone
      on node 1, this patch has improved success rates in stress-highalloc
      mmtests benchmark.  The success here were previously made worse by commit
      3a025760 ("mm: page_alloc: spill to remote nodes before waking
      kswapd") as kswapd was no longer resetting often enough the deferred
      compaction for the Normal zone, and DMA32 zones on both nodes were thus
      not considered for compaction.  On different machine, success rates were
      improved with __GFP_NO_KSWAPD allocations.
      
      [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53853e2d
  13. 04 Jun, 2014 1 commit
  14. 22 Jan, 2014 1 commit
  15. 24 Feb, 2013 1 commit
  16. 11 Jan, 2013 2 commits
    • Mel Gorman's avatar
      mm: compaction: partially revert capture of suitable high-order page · 8fb74b9f
      Mel Gorman authored
      Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
      waiting for POLLIN on a local TCP socket.  It was easier to trigger if
      there was disk IO and dirty pages at the same time and he bisected it to
      commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page
      immediately when it is made available").
      
      The intention of that patch was to improve high-order allocations under
      memory pressure after changes made to reclaim in 3.6 drastically hurt
      THP allocations but the approach was flawed.  For Eric, the problem was
      that page->pfmemalloc was not being cleared for captured pages leading
      to a poor interaction with swap-over-NFS support causing the packets to
      be dropped.  However, I identified a few more problems with the patch
      including the fact that it can increase contention on zone->lock in some
      cases which could result in async direct compaction being aborted early.
      
      In retrospect the capture patch took the wrong approach.  What it should
      have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
      was allocating for THP and avoided races that way.  While the patch was
      showing to improve allocation success rates at the time, the benefit is
      marginal given the relative complexity and it should be revisited from
      scratch in the context of the other reclaim-related changes that have
      taken place since the patch was first written and tested.  This patch
      partially reverts commit 1fb3f8ca ("mm: compaction: capture a
      suitable high-order page immediately when it is made available").
      Reported-and-tested-by: default avatarEric Wong <normalperson@yhbt.net>
      Tested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fb74b9f
    • Mel Gorman's avatar
      mm: compaction: Partially revert capture of suitable high-order page · 47ecfcb7
      Mel Gorman authored
      Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
      waiting for POLLIN on a local TCP socket.  It was easier to trigger if
      there was disk IO and dirty pages at the same time and he bisected it to
      commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page
      immediately when it is made available").
      
      The intention of that patch was to improve high-order allocations under
      memory pressure after changes made to reclaim in 3.6 drastically hurt
      THP allocations but the approach was flawed.  For Eric, the problem was
      that page->pfmemalloc was not being cleared for captured pages leading
      to a poor interaction with swap-over-NFS support causing the packets to
      be dropped.  However, I identified a few more problems with the patch
      including the fact that it can increase contention on zone->lock in some
      cases which could result in async direct compaction being aborted early.
      
      In retrospect the capture patch took the wrong approach.  What it should
      have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
      was allocating for THP and avoided races that way.  While the patch was
      showing to improve allocation success rates at the time, the benefit is
      marginal given the relative complexity and it should be revisited from
      scratch in the context of the other reclaim-related changes that have
      taken place since the patch was first written and tested.  This patch
      partially reverts commit 1fb3f8ca "mm: compaction: capture a suitable
      high-order page immediately when it is made available".
      Reported-and-tested-by: default avatarEric Wong <normalperson@yhbt.net>
      Tested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47ecfcb7
  17. 09 Oct, 2012 2 commits
    • Mel Gorman's avatar
      mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity · 62997027
      Mel Gorman authored
      Compaction caches if a pageblock was scanned and no pages were isolated so
      that the pageblocks can be skipped in the future to reduce scanning.  This
      information is not cleared by the page allocator based on activity due to
      the impact it would have to the page allocator fast paths.  Hence there is
      a requirement that something clear the cache or pageblocks will be skipped
      forever.  Currently the cache is cleared if there were a number of recent
      allocation failures and it has not been cleared within the last 5 seconds.
      Time-based decisions like this are terrible as they have no relationship
      to VM activity and is basically a big hammer.
      
      Unfortunately, accurate heuristics would add cost to some hot paths so
      this patch implements a rough heuristic.  There are two cases where the
      cache is cleared.
      
      1. If a !kswapd process completes a compaction cycle (migrate and free
         scanner meet), the zone is marked compact_blockskip_flush. When kswapd
         goes to sleep, it will clear the cache. This is expected to be the
         common case where the cache is cleared. It does not really matter if
         kswapd happens to be asleep or going to sleep when the flag is set as
         it will be woken on the next allocation request.
      
      2. If there have been multiple failures recently and compaction just
         finished being deferred then a process will clear the cache and start a
         full scan.  This situation happens if there are multiple high-order
         allocation requests under heavy memory pressure.
      
      The clearing of the PG_migrate_skip bits and other scans is inherently
      racy but the race is harmless.  For allocations that can fail such as THP,
      they will simply fail.  For requests that cannot fail, they will retry the
      allocation.  Tests indicated that scanning rates were roughly similar to
      when the time-based heuristic was used and the allocation success rates
      were similar.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62997027
    • Mel Gorman's avatar
      mm: compaction: capture a suitable high-order page immediately when it is made available · 1fb3f8ca
      Mel Gorman authored
      While compaction is migrating pages to free up large contiguous blocks
      for allocation it races with other allocation requests that may steal
      these blocks or break them up.  This patch alters direct compaction to
      capture a suitable free page as soon as it becomes available to reduce
      this race.  It uses similar logic to split_free_page() to ensure that
      watermarks are still obeyed.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1fb3f8ca
  18. 21 Aug, 2012 1 commit
    • Mel Gorman's avatar
      mm: compaction: Abort async compaction if locks are contended or taking too long · c67fe375
      Mel Gorman authored
      Jim Schutt reported a problem that pointed at compaction contending
      heavily on locks.  The workload is straight-forward and in his own words;
      
      	The systems in question have 24 SAS drives spread across 3 HBAs,
      	running 24 Ceph OSD instances, one per drive.  FWIW these servers
      	are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
      	Ceph Linux clients doing dd simultaneously to a Ceph file system
      	backed by 12 of these servers.
      
      Early in the test everything looks fine
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
        27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
        28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
         6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
        22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0
      
      and then it goes to pot
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        163  8          0     464308        576   36791368    0    0    11 22210  866  536   3 13  79  4  0
        207 14          0     917752        576   36181928    0    0   712 1345376 134598 47367   7 90   1  2  0
        123 12          0     685516        576   36296148    0    0   429 1386615 158494 60077   8 84   5  3  0
        123 12          0     598572        576   36333728    0    0  1107 1233281 147542 62351   7 84   5  4  0
        622  7          0     660768        576   36118264    0    0   557 1345548 151394 59353   7 85   4  3  0
        223 11          0     283960        576   36463868    0    0    46 1107160 121846 33006   6 93   1  1  0
      
      Note that system CPU usage is very high blocks being written out has
      dropped by 42%. He analysed this with perf and found
      
        perf record -g -a sleep 10
        perf report --sort symbol --call-graph fractal,5
          34.63%  [k] _raw_spin_lock_irqsave
                  |
                  |--97.30%-- isolate_freepages
                  |          compaction_alloc
                  |          unmap_and_move
                  |          migrate_pages
                  |          compact_zone
                  |          compact_zone_order
                  |          try_to_compact_pages
                  |          __alloc_pages_direct_compact
                  |          __alloc_pages_slowpath
                  |          __alloc_pages_nodemask
                  |          alloc_pages_vma
                  |          do_huge_pmd_anonymous_page
                  |          handle_mm_fault
                  |          do_page_fault
                  |          page_fault
                  |          |
                  |          |--87.39%-- skb_copy_datagram_iovec
                  |          |          tcp_recvmsg
                  |          |          inet_recvmsg
                  |          |          sock_recvmsg
                  |          |          sys_recvfrom
                  |          |          system_call
                  |          |          __recv
                  |          |          |
                  |          |           --100.00%-- (nil)
                  |          |
                  |           --12.61%-- memcpy
                   --2.70%-- [...]
      
      There was other data but primarily it is all showing that compaction is
      contended heavily on the zone->lock and zone->lru_lock.
      
      commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
      while isolating pages for migration] noted that it was possible for
      migration to hold the lru_lock for an excessive amount of time. Very
      broadly speaking this patch expands the concept.
      
      This patch introduces compact_checklock_irqsave() to check if a lock
      is contended or the process needs to be scheduled. If either condition
      is true then async compaction is aborted and the caller is informed.
      The page allocator will fail a THP allocation if compaction failed due
      to contention. This patch also introduces compact_trylock_irqsave()
      which will acquire the lock only if it is not contended and the process
      does not need to schedule.
      Reported-by: default avatarJim Schutt <jaschut@sandia.gov>
      Tested-by: default avatarJim Schutt <jaschut@sandia.gov>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c67fe375
  19. 01 Aug, 2012 1 commit
  20. 04 Jun, 2012 1 commit
  21. 29 May, 2012 1 commit
    • Bartlomiej Zolnierkiewicz's avatar
      mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks · 5ceb9ce6
      Bartlomiej Zolnierkiewicz authored
      When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
      pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
      allocation takes ownership of the block may take too long.  The type of
      the pageblock remains unchanged so the pageblock cannot be used as a
      migration target during compaction.
      
      Fix it by:
      
      * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
        COMPACT_SYNC) and then converting sync field in struct compact_control
        to use it.
      
      * Adding nr_pageblocks_skipped field to struct compact_control and
        tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
         If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
        try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
        suitable page for allocation.  In this case then check how if there were
        enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
        COMPACT_ASYNC_UNMOVABLE mode.
      
      * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
        COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
        finding PageBuddy pages, page_count(page) == 0 or PageLRU pages.  If all
        pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
        sets change the whole pageblock type to MIGRATE_MOVABLE.
      
      My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
      131072 standard 4KiB pages in 'Normal' zone) is to:
      
      - allocate 120000 pages for kernel's usage
      - free every second page (60000 pages) of memory just allocated
      - allocate and use 60000 pages from user space
      - free remaining 60000 pages of kernel memory
        (now we have fragmented memory occupied mostly by user space pages)
      - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage
      
      The results:
      - with compaction disabled I get 11 successful allocations
      - with compaction enabled - 14 successful allocations
      - with this patch I'm able to get all 100 successful allocations
      
      NOTE: If we can make kswapd aware of order-0 request during compaction, we
      can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
      (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE).  Please see the
      following thread:
      
      	http://marc.info/?l=linux-mm&m=133552069417068&w=2
      
      [minchan@kernel.org: minor cleanups]
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: default avatarKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ceb9ce6