1. 06 Sep, 2019 3 commits
  2. 29 Aug, 2019 8 commits
  3. 25 Aug, 2019 10 commits
    • Yang Shi's avatar
      Revert "kmemleak: allow to coexist with fault injection" · b1d93b72
      Yang Shi authored
      [ Upstream commit df9576de ]
      
      When running ltp's oom test with kmemleak enabled, the below warning was
      triggerred since kernel detects __GFP_NOFAIL & ~__GFP_DIRECT_RECLAIM is
      passed in:
      
        WARNING: CPU: 105 PID: 2138 at mm/page_alloc.c:4608 __alloc_pages_nodemask+0x1c31/0x1d50
        Modules linked in: loop dax_pmem dax_pmem_core ip_tables x_tables xfs virtio_net net_failover virtio_blk failover ata_generic virtio_pci virtio_ring virtio libata
        CPU: 105 PID: 2138 Comm: oom01 Not tainted 5.2.0-next-20190710+ #7
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
        RIP: 0010:__alloc_pages_nodemask+0x1c31/0x1d50
        ...
         kmemleak_alloc+0x4e/0xb0
         kmem_cache_alloc+0x2a7/0x3e0
         mempool_alloc_slab+0x2d/0x40
         mempool_alloc+0x118/0x2b0
         bio_alloc_bioset+0x19d/0x350
         get_swap_bio+0x80/0x230
         __swap_writepage+0x5ff/0xb20
      
      The mempool_alloc_slab() clears __GFP_DIRECT_RECLAIM, however kmemleak
      has __GFP_NOFAIL set all the time due to d9570ee3 ("kmemleak:
      allow to coexist with fault injection").  But, it doesn't make any sense
      to have __GFP_NOFAIL and ~__GFP_DIRECT_RECLAIM specified at the same
      time.
      
      According to the discussion on the mailing list, the commit should be
      reverted for short term solution.  Catalin Marinas would follow up with
      a better solution for longer term.
      
      The failure rate of kmemleak metadata allocation may increase in some
      circumstances, but this should be expected side effect.
      
      Link: http://lkml.kernel.org/r/1563299431-111710-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: d9570ee3 ("kmemleak: allow to coexist with fault injection")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Suggested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b1d93b72
    • Christoph Hellwig's avatar
      mm/hmm: always return EBUSY for invalid ranges in hmm_range_{fault,snapshot} · 424f6f05
      Christoph Hellwig authored
      [ Upstream commit 2bcbeaef ]
      
      We should not have two different error codes for the same
      condition. EAGAIN must be reserved for the FAULT_FLAG_ALLOW_RETRY retry
      case and signals to the caller that the mmap_sem has been unlocked.
      
      Use EBUSY for the !valid case so that callers can get the locking right.
      
      Link: https://lore.kernel.org/r/20190724065258.16603-2-hch@lst.deTested-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      [jgg: elaborated commit message]
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      424f6f05
    • Mel Gorman's avatar
      mm, vmscan: do not special-case slab reclaim when watermarks are boosted · 0a8ae1db
      Mel Gorman authored
      commit 28360f39 upstream.
      
      Dave Chinner reported a problem pointing a finger at commit 1c30844d
      ("mm: reclaim small amounts of memory when an external fragmentation
      event occurs").
      
      The report is extensive:
      
        https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/
      
      and it's worth recording the most relevant parts (colorful language and
      typos included).
      
      	When running a simple, steady state 4kB file creation test to
      	simulate extracting tarballs larger than memory full of small
      	files into the filesystem, I noticed that once memory fills up
      	the cache balance goes to hell.
      
      	The workload is creating one dirty cached inode for every dirty
      	page, both of which should require a single IO each to clean and
      	reclaim, and creation of inodes is throttled by the rate at which
      	dirty writeback runs at (via balance dirty pages). Hence the ingest
      	rate of new cached inodes and page cache pages is identical and
      	steady. As a result, memory reclaim should quickly find a steady
      	balance between page cache and inode caches.
      
      	The moment memory fills, the page cache is reclaimed at a much
      	faster rate than the inode cache, and evidence suggests that
      	the inode cache shrinker is not being called when large batches
      	of pages are being reclaimed. In roughly the same time period
      	that it takes to fill memory with 50% pages and 50% slab caches,
      	memory reclaim reduces the page cache down to just dirty pages
      	and slab caches fill the entirety of memory.
      
      	The LRU is largely full of dirty pages, and we're getting spikes
      	of random writeback from memory reclaim so it's all going to shit.
      	Behaviour never recovers, the page cache remains pinned at just
      	dirty pages, and nothing I could tune would make any difference.
      	vfs_cache_pressure makes no difference - I would set it so high
      	it should trim the entire inode caches in a single pass, yet it
      	didn't do anything. It was clear from tracing and live telemetry
      	that the shrinkers were pretty much not running except when
      	there was absolutely no memory free at all, and then they did
      	the minimum necessary to free memory to make progress.
      
      	So I went looking at the code, trying to find places where pages
      	got reclaimed and the shrinkers weren't called. There's only one
      	- kswapd doing boosted reclaim as per commit 1c30844d ("mm:
      	reclaim small amounts of memory when an external fragmentation
      	event occurs").
      
      The watermark boosting introduced by the commit is triggered in response
      to an allocation "fragmentation event".  The boosting was not intended
      to target THP specifically and triggers even if THP is disabled.
      However, with Dave's perfectly reasonable workload, fragmentation events
      can be very common given the ratio of slab to page cache allocations so
      boosting remains active for long periods of time.
      
      As high-order allocations might use compaction and compaction cannot
      move slab pages the decision was made in the commit to special-case
      kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
      reclaiming slab does not directly help compaction.
      
      As Dave notes, this decision means that slab can be artificially
      protected for long periods of time and messes up the balance with slab
      and page caches.
      
      Removing the special casing can still indirectly help avoid
      fragmentation by avoiding fragmentation-causing events due to slab
      allocation as pages from a slab pageblock will have some slab objects
      freed.  Furthermore, with the special casing, reclaim behaviour is
      unpredictable as kswapd sometimes examines slab and sometimes does not
      in a manner that is tricky to tune or analyse.
      
      This patch removes the special casing.  The downside is that this is not
      a universal performance win.  Some benchmarks that depend on the
      residency of data when rereading metadata may see a regression when slab
      reclaim is restored to its original behaviour.  Similarly, some
      benchmarks that only read-once or write-once may perform better when
      page reclaim is too aggressive.  The primary upside is that slab
      shrinker is less surprising (arguably more sane but that's a matter of
      opinion), behaves consistently regardless of the fragmentation state of
      the system and properly obeys VM sysctls.
      
      A fsmark benchmark configuration was constructed similar to what Dave
      reported and is codified by the mmtest configuration
      config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket
      machine to avoid dealing with NUMA-related issues and the timing of
      reclaim.  The storage was an SSD Samsung Evo and a fresh trimmed XFS
      filesystem was used for the test data.
      
      This is not an exact replication of Dave's setup.  The configuration
      scales its parameters depending on the memory size of the SUT to behave
      similarly across machines.  The parameters mean the first sample
      reported by fs_mark is using 50% of RAM which will barely be throttled
      and look like a big outlier.  Dave used fake NUMA to have multiple
      kswapd instances which I didn't replicate.  Finally, the number of
      iterations differ from Dave's test as the target disk was not large
      enough.  While not identical, it should be representative.
      
        fsmark
                                           5.3.0-rc3              5.3.0-rc3
                                             vanilla          shrinker-v1r1
        Min       1-files/sec     4444.80 (   0.00%)     4765.60 (   7.22%)
        1st-qrtle 1-files/sec     5005.10 (   0.00%)     5091.70 (   1.73%)
        2nd-qrtle 1-files/sec     4917.80 (   0.00%)     4855.60 (  -1.26%)
        3rd-qrtle 1-files/sec     4667.40 (   0.00%)     4831.20 (   3.51%)
        Max-1     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-5     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-10    1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-90    1-files/sec     4649.60 (   0.00%)     4780.70 (   2.82%)
        Max-95    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max-99    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max       1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Hmean     1-files/sec     5004.75 (   0.00%)     5075.96 (   1.42%)
        Stddev    1-files/sec     1778.70 (   0.00%)     1369.66 (  23.00%)
        CoeffVar  1-files/sec       33.70 (   0.00%)       26.05 (  22.71%)
        BHmean-99 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-95 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-90 1-files/sec     5107.05 (   0.00%)     5131.41 (   0.48%)
        BHmean-75 1-files/sec     5208.45 (   0.00%)     5206.68 (  -0.03%)
        BHmean-50 1-files/sec     5405.53 (   0.00%)     5381.62 (  -0.44%)
        BHmean-25 1-files/sec     6179.75 (   0.00%)     6095.14 (  -1.37%)
      
                           5.3.0-rc3   5.3.0-rc3
                             vanillashrinker-v1r1
        Duration User         501.82      497.29
        Duration System      4401.44     4424.08
        Duration Elapsed     8124.76     8358.05
      
      This is showing a slight skew for the max result representing a large
      outlier for the 1st, 2nd and 3rd quartile are similar indicating that
      the bulk of the results show little difference.  Note that an earlier
      version of the fsmark configuration showed a regression but that
      included more samples taken while memory was still filling.
      
      Note that the elapsed time is higher.  Part of this is that the
      configuration included time to delete all the test files when the test
      completes -- the test automation handles the possibility of testing
      fsmark with multiple thread counts.  Without the patch, many of these
      objects would be memory resident which is part of what the patch is
      addressing.
      
      There are other important observations that justify the patch.
      
      1. With the vanilla kernel, the number of dirty pages in the system is
         very low for much of the test. With this patch, dirty pages is
         generally kept at 10% which matches vm.dirty_background_ratio which
         is normal expected historical behaviour.
      
      2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
         0.95 for much of the test i.e. Slab is being left alone and
         dominating memory consumption. With the patch applied, the ratio
         varies between 0.35 and 0.45 with the bulk of the measured ratios
         roughly half way between those values. This is a different balance to
         what Dave reported but it was at least consistent.
      
      3. Slabs are scanned throughout the entire test with the patch applied.
         The vanille kernel has periods with no scan activity and then
         relatively massive spikes.
      
      4. Without the patch, kswapd scan rates are very variable. With the
         patch, the scan rates remain quite steady.
      
      4. Overall vmstats are closer to normal expectations
      
      	                                5.3.0-rc3      5.3.0-rc3
      	                                  vanilla  shrinker-v1r1
          Ops Direct pages scanned             99388.00      328410.00
          Ops Kswapd pages scanned          45382917.00    33451026.00
          Ops Kswapd pages reclaimed        30869570.00    25239655.00
          Ops Direct pages reclaimed           74131.00        5830.00
          Ops Kswapd efficiency %                 68.02          75.45
          Ops Kswapd velocity                   5585.75        4002.25
          Ops Page reclaim immediate         1179721.00      430927.00
          Ops Slabs scanned                 62367361.00    73581394.00
          Ops Direct inode steals               2103.00        1002.00
          Ops Kswapd inode steals             570180.00     5183206.00
      
      	o Vanilla kernel is hitting direct reclaim more frequently,
      	  not very much in absolute terms but the fact the patch
      	  reduces it is interesting
      	o "Page reclaim immediate" in the vanilla kernel indicates
      	  dirty pages are being encountered at the tail of the LRU.
      	  This is generally bad and means in this case that the LRU
      	  is not long enough for dirty pages to be cleaned by the
      	  background flush in time. This is much reduced by the
      	  patch.
      	o With the patch, kswapd is reclaiming 10 times more slab
      	  pages than with the vanilla kernel. This is indicative
      	  of the watermark boosting over-protecting slab
      
      A more complete set of tests were run that were part of the basis for
      introducing boosting and while there are some differences, they are well
      within tolerances.
      
      Bottom line, the special casing kswapd to avoid slab behaviour is
      unpredictable and can lead to abnormal results for normal workloads.
      
      This patch restores the expected behaviour that slab and page cache is
      balanced consistently for a workload with a steady allocation ratio of
      slab/pagecache pages.  It also means that if there are workloads that
      favour the preservation of slab over pagecache that it can be tuned via
      vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
      the parameter when boosting is active.
      
      Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
      Fixes: 1c30844d ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: Mel Gorman's avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0a8ae1db
    • Isaac J. Manjarres's avatar
      mm/usercopy: use memory range to be accessed for wraparound check · 04a6826b
      Isaac J. Manjarres authored
      commit 95153169 upstream.
      
      Currently, when checking to see if accessing n bytes starting at address
      "ptr" will cause a wraparound in the memory addresses, the check in
      check_bogus_address() adds an extra byte, which is incorrect, as the
      range of addresses that will be accessed is [ptr, ptr + (n - 1)].
      
      This can lead to incorrectly detecting a wraparound in the memory
      address, when trying to read 4 KB from memory that is mapped to the the
      last possible page in the virtual address space, when in fact, accessing
      that range of memory would not cause a wraparound to occur.
      
      Use the memory range that will actually be accessed when considering if
      accessing a certain amount of bytes will cause the memory address to
      wrap around.
      
      Link: http://lkml.kernel.org/r/1564509253-23287-1-git-send-email-isaacm@codeaurora.org
      Fixes: f5509cc1 ("mm: Hardened usercopy")
      Signed-off-by: default avatarPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: default avatarIsaac J. Manjarres <isaacm@codeaurora.org>
      Co-developed-by: default avatarPrasad Sodagudi <psodagud@codeaurora.org>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Trilok Soni <tsoni@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      04a6826b
    • Miles Chen's avatar
      mm/memcontrol.c: fix use after free in mem_cgroup_iter() · 5ae015cd
      Miles Chen authored
      commit 54a83d6b upstream.
      
      This patch is sent to report an use after free in mem_cgroup_iter()
      after merging commit be2657752e9e ("mm: memcg: fix use after free in
      mem_cgroup_iter()").
      
      I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e
      ("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged
      to the trees.  However, I can still observe use after free issues
      addressed in the commit be2657752e9e.  (on low-end devices, a few times
      this month)
      
      backtrace:
              css_tryget <- crash here
              mem_cgroup_iter
              shrink_node
              shrink_zones
              do_try_to_free_pages
              try_to_free_pages
              __perform_reclaim
              __alloc_pages_direct_reclaim
              __alloc_pages_slowpath
              __alloc_pages_nodemask
      
      To debug, I poisoned mem_cgroup before freeing it:
      
        static void __mem_cgroup_free(struct mem_cgroup *memcg)
              for_each_node(node)
              free_mem_cgroup_per_node_info(memcg, node);
              free_percpu(memcg->stat);
        +     /* poison memcg before freeing it */
        +     memset(memcg, 0x78, sizeof(struct mem_cgroup));
              kfree(memcg);
        }
      
      The coredump shows the position=0xdbbc2a00 is freed.
      
        (gdb) p/x ((struct mem_cgroup_per_node *)0xe5009e00)->iter[8]
        $13 = {position = 0xdbbc2a00, generation = 0x2efd}
      
        0xdbbc2a00:     0xdbbc2e00      0x00000000      0xdbbc2800      0x00000100
        0xdbbc2a10:     0x00000200      0x78787878      0x00026218      0x00000000
        0xdbbc2a20:     0xdcad6000      0x00000001      0x78787800      0x00000000
        0xdbbc2a30:     0x78780000      0x00000000      0x0068fb84      0x78787878
        0xdbbc2a40:     0x78787878      0x78787878      0x78787878      0xe3fa5cc0
        0xdbbc2a50:     0x78787878      0x78787878      0x00000000      0x00000000
        0xdbbc2a60:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a70:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a80:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a90:     0x00000001      0x00000000      0x00000000      0x00100000
        0xdbbc2aa0:     0x00000001      0xdbbc2ac8      0x00000000      0x00000000
        0xdbbc2ab0:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2ac0:     0x00000000      0x00000000      0xe5b02618      0x00001000
        0xdbbc2ad0:     0x00000000      0x78787878      0x78787878      0x78787878
        0xdbbc2ae0:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2af0:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b00:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b10:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b20:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b30:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b40:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b50:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b60:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b70:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b80:     0x78787878      0x78787878      0x00000000      0x78787878
        0xdbbc2b90:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2ba0:     0x78787878      0x78787878      0x78787878      0x78787878
      
      In the reclaim path, try_to_free_pages() does not setup
      sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ...,
      shrink_node().
      
      In mem_cgroup_iter(), root is set to root_mem_cgroup because
      sc->target_mem_cgroup is NULL.  It is possible to assign a memcg to
      root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter().
      
              try_to_free_pages
              	struct scan_control sc = {...}, target_mem_cgroup is 0x0;
              do_try_to_free_pages
              shrink_zones
              shrink_node
              	 mem_cgroup *root = sc->target_mem_cgroup;
              	 memcg = mem_cgroup_iter(root, NULL, &reclaim);
              mem_cgroup_iter()
              	if (!root)
              		root = root_mem_cgroup;
              	...
      
              	css = css_next_descendant_pre(css, &root->css);
              	memcg = mem_cgroup_from_css(css);
              	cmpxchg(&iter->position, pos, memcg);
      
      My device uses memcg non-hierarchical mode.  When we release a memcg:
      invalidate_reclaim_iterators() reaches only dead_memcg and its parents.
      If non-hierarchical mode is used, invalidate_reclaim_iterators() never
      reaches root_mem_cgroup.
      
        static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
        {
              struct mem_cgroup *memcg = dead_memcg;
      
              for (; memcg; memcg = parent_mem_cgroup(memcg)
              ...
        }
      
      So the use after free scenario looks like:
      
        CPU1						CPU2
      
        try_to_free_pages
        do_try_to_free_pages
        shrink_zones
        shrink_node
        mem_cgroup_iter()
            if (!root)
            	root = root_mem_cgroup;
            ...
            css = css_next_descendant_pre(css, &root->css);
            memcg = mem_cgroup_from_css(css);
            cmpxchg(&iter->position, pos, memcg);
      
              				invalidate_reclaim_iterators(memcg);
              				...
              				__mem_cgroup_free()
              					kfree(memcg);
      
        try_to_free_pages
        do_try_to_free_pages
        shrink_zones
        shrink_node
        mem_cgroup_iter()
            if (!root)
            	root = root_mem_cgroup;
            ...
            mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
            iter = &mz->iter[reclaim->priority];
            pos = READ_ONCE(iter->position);
            css_tryget(&pos->css) <- use after free
      
      To avoid this, we should also invalidate root_mem_cgroup.nodeinfo.iter
      in invalidate_reclaim_iterators().
      
      [cai@lca.pw: fix -Wparentheses compilation warning]
        Link: http://lkml.kernel.org/r/1564580753-17531-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190730015729.4406-1-miles.chen@mediatek.com
      Fixes: 5ac8fb31 ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
      Signed-off-by: default avatarMiles Chen <miles.chen@mediatek.com>
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ae015cd
    • Henry Burns's avatar
      mm/z3fold.c: fix z3fold_destroy_pool() race condition · a6b0004e
      Henry Burns authored
      commit b997052b upstream.
      
      The constraint from the zpool use of z3fold_destroy_pool() is there are
      no outstanding handles to memory (so no active allocations), but it is
      possible for there to be outstanding work on either of the two wqs in
      the pool.
      
      Calling z3fold_deregister_migration() before the workqueues are drained
      means that there can be allocated pages referencing a freed inode,
      causing any thread in compaction to be able to trip over the bad pointer
      in PageMovable().
      
      Link: http://lkml.kernel.org/r/20190726224810.79660-2-henryburns@google.com
      Fixes: 1f862989 ("mm/z3fold.c: support page migration")
      Signed-off-by: default avatarHenry Burns <henryburns@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarJonathan Adams <jwadams@google.com>
      Cc: Vitaly Vul <vitaly.vul@sony.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6b0004e
    • Henry Burns's avatar
      mm/z3fold.c: fix z3fold_destroy_pool() ordering · d87e9ae7
      Henry Burns authored
      commit 6051d3bd upstream.
      
      The constraint from the zpool use of z3fold_destroy_pool() is there are
      no outstanding handles to memory (so no active allocations), but it is
      possible for there to be outstanding work on either of the two wqs in
      the pool.
      
      If there is work queued on pool->compact_workqueue when it is called,
      z3fold_destroy_pool() will do:
      
         z3fold_destroy_pool()
           destroy_workqueue(pool->release_wq)
           destroy_workqueue(pool->compact_wq)
             drain_workqueue(pool->compact_wq)
               do_compact_page(zhdr)
                 kref_put(&zhdr->refcount)
                   __release_z3fold_page(zhdr, ...)
                     queue_work_on(pool->release_wq, &pool->work) *BOOM*
      
      So compact_wq needs to be destroyed before release_wq.
      
      Link: http://lkml.kernel.org/r/20190726224810.79660-1-henryburns@google.com
      Fixes: 5d03a661 ("mm/z3fold.c: use kref to prevent page free/compact race")
      Signed-off-by: default avatarHenry Burns <henryburns@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarJonathan Adams <jwadams@google.com>
      Cc: Vitaly Vul <vitaly.vul@sony.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d87e9ae7
    • Yang Shi's avatar
      mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind · 5c0e391b
      Yang Shi authored
      commit a53190a4 upstream.
      
      When running syzkaller internally, we ran into the below bug on 4.9.x
      kernel:
      
        kernel BUG at mm/huge_memory.c:2124!
        invalid opcode: 0000 [#1] SMP KASAN
        CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
        task: ffff880067b34900 task.stack: ffff880068998000
        RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
        Call Trace:
          split_huge_page include/linux/huge_mm.h:100 [inline]
          queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
          walk_pmd_range mm/pagewalk.c:50 [inline]
          walk_pud_range mm/pagewalk.c:90 [inline]
          walk_pgd_range mm/pagewalk.c:116 [inline]
          __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
          walk_page_range+0x154/0x370 mm/pagewalk.c:285
          queue_pages_range+0x115/0x150 mm/mempolicy.c:694
          do_mbind mm/mempolicy.c:1241 [inline]
          SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
          SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
          do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
          entry_SYSCALL_64_after_swapgs+0x5d/0xdb
        Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c
        RIP  [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
         RSP <ffff88006899f980>
      
      with the below test:
      
        uint64_t r[1] = {0xffffffffffffffff};
      
        int main(void)
        {
              syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
                                      intptr_t res = 0;
              res = syscall(__NR_socket, 0x11, 3, 0x300);
              if (res != -1)
                      r[0] = res;
              *(uint32_t*)0x20000040 = 0x10000;
              *(uint32_t*)0x20000044 = 1;
              *(uint32_t*)0x20000048 = 0xc520;
              *(uint32_t*)0x2000004c = 1;
              syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
              syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
              *(uint64_t*)0x20000340 = 2;
              syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3);
              return 0;
        }
      
      Actually the test does:
      
        mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
        socket(AF_PACKET, SOCK_RAW, 768)        = 3
        setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
        mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
        mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0
      
      The setsockopt() would allocate compound pages (16 pages in this test)
      for packet tx ring, then the mmap() would call packet_mmap() to map the
      pages into the user address space specified by the mmap() call.
      
      When calling mbind(), it would scan the vma to queue the pages for
      migration to the new node.  It would split any huge page since 4.9
      doesn't support THP migration, however, the packet tx ring compound
      pages are not THP and even not movable.  So, the above bug is triggered.
      
      However, the later kernel is not hit by this issue due to commit
      d44d363f ("mm: don't assume anonymous pages have SwapBacked flag"),
      which just removes the PageSwapBacked check for a different reason.
      
      But, there is a deeper issue.  According to the semantic of mbind(), it
      should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
      MPOL_MF_STRICT was also specified, but the kernel was unable to move all
      existing pages in the range.  The tx ring of the packet socket is
      definitely not movable, however, mbind() returns success for this case.
      
      Although the most socket file associates with non-movable pages, but XDP
      may have movable pages from gup.  So, it sounds not fine to just check
      the underlying file type of vma in vma_migratable().
      
      Change migrate_page_add() to check if the page is movable or not, if it
      is unmovable, just return -EIO.  But do not abort pte walk immediately,
      since there may be pages off LRU temporarily.  We should migrate other
      pages if MPOL_MF_MOVE* is specified.  Set has_unmovable flag if some
      paged could not be not moved, then return -EIO for mbind() eventually.
      
      With this change the above test would return -EIO as expected.
      
      [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
        Link: http://lkml.kernel.org/r/1563556862-54056-3-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1561162809-59140-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5c0e391b
    • Yang Shi's avatar
      mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified · f796f8de
      Yang Shi authored
      commit d8835445 upstream.
      
      When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
      try best to migrate misplaced pages, if some of the pages could not be
      migrated, then return -EIO.
      
      There are three different sub-cases:
       1. vma is not migratable
       2. vma is migratable, but there are unmovable pages
       3. vma is migratable, pages are movable, but migrate_pages() fails
      
      If #1 happens, kernel would just abort immediately, then return -EIO,
      after a7f40cfe ("mm: mempolicy: make mbind() return -EIO when
      MPOL_MF_STRICT is specified").
      
      If #3 happens, kernel would set policy and migrate pages with
      best-effort, but won't rollback the migrated pages and reset the policy
      back.
      
      Before that commit, they behaves in the same way.  It'd better to keep
      their behavior consistent.  But, rolling back the migrated pages and
      resetting the policy back sounds not feasible, so just make #1 behave as
      same as #3.
      
      Userspace will know that not everything was successfully migrated (via
      -EIO), and can take whatever steps it deems necessary - attempt
      rollback, determine which exact page(s) are violating the policy, etc.
      
      Make queue_pages_range() return 1 to indicate there are unmovable pages
      or vma is not migratable.
      
      The #2 is not handled correctly in the current kernel, the following
      patch will fix it.
      
      [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
        Link: http://lkml.kernel.org/r/1563556862-54056-2-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1561162809-59140-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f796f8de
    • Ralph Campbell's avatar
      mm/hmm: fix bad subpage pointer in try_to_unmap_one · b65f418c
      Ralph Campbell authored
      commit 1de13ee5 upstream.
      
      When migrating an anonymous private page to a ZONE_DEVICE private page,
      the source page->mapping and page->index fields are copied to the
      destination ZONE_DEVICE struct page and the page_mapcount() is
      increased.  This is so rmap_walk() can be used to unmap and migrate the
      page back to system memory.
      
      However, try_to_unmap_one() computes the subpage pointer from a swap pte
      which computes an invalid page pointer and a kernel panic results such
      as:
      
        BUG: unable to handle page fault for address: ffffea1fffffffc8
      
      Currently, only single pages can be migrated to device private memory so
      no subpage computation is needed and it can be set to "page".
      
      [rcampbell@nvidia.com: add comment]
        Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
      Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
      Fixes: a5430dda ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b65f418c
  4. 16 Aug, 2019 1 commit
  5. 06 Aug, 2019 10 commits
  6. 31 Jul, 2019 8 commits
    • Huang Ying's avatar
      mm, swap: fix race between swapoff and some swap operations · 12b4d230
      Huang Ying authored
      [ Upstream commit eb085574 ]
      
      When swapin is performed, after getting the swap entry information from
      the page table, system will swap in the swap entry, without any lock held
      to prevent the swap device from being swapoff.  This may cause the race
      like below,
      
      CPU 1				CPU 2
      -----				-----
      				do_swap_page
      				  swapin_readahead
      				    __read_swap_cache_async
      swapoff				      swapcache_prepare
        p->swap_map = NULL		        __swap_duplicate
      					  p->swap_map[?] /* !!! NULL pointer access */
      
      Because swapoff is usually done when system shutdown only, the race may
      not hit many people in practice.  But it is still a race need to be fixed.
      
      To fix the race, get_swap_device() is added to check whether the specified
      swap entry is valid in its swap device.  If so, it will keep the swap
      entry valid via preventing the swap device from being swapoff, until
      put_swap_device() is called.
      
      Because swapoff() is very rare code path, to make the normal path runs as
      fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
      reference count is used to implement get/put_swap_device().  >From
      get_swap_device() to put_swap_device(), RCU reader side is locked, so
      synchronize_rcu() in swapoff() will wait until put_swap_device() is
      called.
      
      In addition to swap_map, cluster_info, etc.  data structure in the struct
      swap_info_struct, the swap cache radix tree will be freed after swapoff,
      so this patch fixes the race between swap cache looking up and swapoff
      too.
      
      Races between some other swap cache usages and swapoff are fixed too via
      calling synchronize_rcu() between clearing PageSwapCache() and freeing
      swap cache data structure.
      
      Another possible method to fix this is to use preempt_off() +
      stop_machine() to prevent the swap device from being swapoff when its data
      structure is being accessed.  The overhead in hot-path of both methods is
      similar.  The advantages of RCU based method are,
      
      1. stop_machine() may disturb the normal execution code path on other
         CPUs.
      
      2. File cache uses RCU to protect its radix tree.  If the similar
         mechanism is used for swap cache too, it is easier to share code
         between them.
      
      3. RCU is used to protect swap cache in total_swapcache_pages() and
         exit_swap_address_space() already.  The two mechanisms can be
         merged to simplify the logic.
      
      Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
      Fixes: 235b6217 ("mm/swap: add cluster lock")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Not-nacked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      12b4d230
    • Konstantin Khlebnikov's avatar
      mm: use down_read_killable for locking mmap_sem in access_remote_vm · 2a8d3447
      Konstantin Khlebnikov authored
      [ Upstream commit 1e426fe2 ]
      
      This function is used by ptrace and proc files like /proc/pid/cmdline and
      /proc/pid/environ.
      
      Access_remote_vm never returns error codes, all errors are ignored and
      only size of successfully read data is returned.  So, if current task was
      killed we'll simply return 0 (bytes read).
      
      Mmap_sem could be locked for a long time or forever if something goes
      wrong.  Using a killable lock permits cleanup of stuck tasks and
      simplifies investigation.
      
      Link: http://lkml.kernel.org/r/156007494202.3335.16782303099589302087.stgit@buzzSigned-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: Michal Koutný's avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      2a8d3447
    • Jean-Philippe Brucker's avatar
      mm/mmu_notifier: use hlist_add_head_rcu() · ba56ef5f
      Jean-Philippe Brucker authored
      [ Upstream commit 543bdb2d ]
      
      Make mmu_notifier_register() safer by issuing a memory barrier before
      registering a new notifier.  This fixes a theoretical bug on weakly
      ordered CPUs.  For example, take this simplified use of notifiers by a
      driver:
      
      	my_struct->mn.ops = &my_ops; /* (1) */
      	mmu_notifier_register(&my_struct->mn, mm)
      		...
      		hlist_add_head(&mn->hlist, &mm->mmu_notifiers); /* (2) */
      		...
      
      Once mmu_notifier_register() releases the mm locks, another thread can
      invalidate a range:
      
      	mmu_notifier_invalidate_range()
      		...
      		hlist_for_each_entry_rcu(mn, &mm->mmu_notifiers, hlist) {
      			if (mn->ops->invalidate_range)
      
      The read side relies on the data dependency between mn and ops to ensure
      that the pointer is properly initialized.  But the write side doesn't have
      any dependency between (1) and (2), so they could be reordered and the
      readers could dereference an invalid mn->ops.  mmu_notifier_register()
      does take all the mm locks before adding to the hlist, but those have
      acquire semantics which isn't sufficient.
      
      By calling hlist_add_head_rcu() instead of hlist_add_head() we update the
      hlist using a store-release, ensuring that readers see prior
      initialization of my_struct.  This situation is better illustated by
      litmus test MP+onceassign+derefonce.
      
      Link: http://lkml.kernel.org/r/20190502133532.24981-1-jean-philippe.brucker@arm.com
      Fixes: cddb8a5c ("mmu-notifiers: core")
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ba56ef5f
    • Andy Lutomirski's avatar
      mm/gup.c: remove some BUG_ONs from get_gate_page() · 0ebe6d42
      Andy Lutomirski authored
      [ Upstream commit b5d1c39f ]
      
      If we end up without a PGD or PUD entry backing the gate area, don't BUG
      -- just fail gracefully.
      
      It's not entirely implausible that this could happen some day on x86.  It
      doesn't right now even with an execute-only emulated vsyscall page because
      the fixmap shares the PUD, but the core mm code shouldn't rely on that
      particular detail to avoid OOPSing.
      
      Link: http://lkml.kernel.org/r/a1d9f4efb75b9d464e59fd6af00104b21c58f6f7.1561610798.git.luto@kernel.orgSigned-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0ebe6d42
    • Guenter Roeck's avatar
      mm/gup.c: mark undo_dev_pagemap as __maybe_unused · ffd51eba
      Guenter Roeck authored
      [ Upstream commit 790c7369 ]
      
      Several mips builds generate the following build warning.
      
        mm/gup.c:1788:13: warning: 'undo_dev_pagemap' defined but not used
      
      The function is declared unconditionally but only called from behind
      various ifdefs. Mark it __maybe_unused.
      
      Link: http://lkml.kernel.org/r/1562072523-22311-1-git-send-email-linux@roeck-us.netSigned-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ffd51eba
    • Huang Ying's avatar
      mm/mincore.c: fix race between swapoff and mincore · 8aaa5eef
      Huang Ying authored
      [ Upstream commit aeb309b8 ]
      
      Via commit 4b3ef9da ("mm/swap: split swap cache into 64MB trunks"),
      after swapoff, the address_space associated with the swap device will be
      freed.  So swap_address_space() users which touch the address_space need
      some kind of mechanism to prevent the address_space from being freed
      during accessing.
      
      When mincore processes an unmapped range for swapped shmem pages, it
      doesn't hold the lock to prevent swap device from being swapped off.  So
      the following race is possible:
      
      CPU1					CPU2
      do_mincore()				swapoff()
        walk_page_range()
          mincore_unmapped_range()
            __mincore_unmapped_range
              mincore_page
      	  as = swap_address_space()
                ...				  exit_swap_address_space()
                ...				    kvfree(spaces)
      	  find_get_page(as)
      
      The address space may be accessed after being freed.
      
      To fix the race, get_swap_device()/put_swap_device() is used to enclose
      find_get_page() to check whether the swap entry is valid and prevent the
      swap device from being swapoff during accessing.
      
      Link: http://lkml.kernel.org/r/20190611020510.28251-1-ying.huang@intel.com
      Fixes: 4b3ef9da ("mm/swap: split swap cache into 64MB trunks")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8aaa5eef
    • Dmitry Vyukov's avatar
      mm/kmemleak.c: fix check for softirq context · d90e2ab5
      Dmitry Vyukov authored
      [ Upstream commit 6ef90569 ]
      
      in_softirq() is a wrong predicate to check if we are in a softirq
      context.  It also returns true if we have BH disabled, so objects are
      falsely stamped with "softirq" comm.  The correct predicate is
      in_serving_softirq().
      
      If user does cat from /sys/kernel/debug/kmemleak previously they would
      see this, which is clearly wrong, this is system call context (see the
      comm):
      
      unreferenced object 0xffff88805bd661c0 (size 64):
        comm "softirq", pid 0, jiffies 4294942959 (age 12.400s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00  ................
          00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
        backtrace:
          [<0000000007dcb30c>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
          [<0000000007dcb30c>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<0000000007dcb30c>] slab_alloc mm/slab.c:3326 [inline]
          [<0000000007dcb30c>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
          [<00000000969722b7>] kmalloc include/linux/slab.h:547 [inline]
          [<00000000969722b7>] kzalloc include/linux/slab.h:742 [inline]
          [<00000000969722b7>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
          [<00000000969722b7>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
          [<00000000a4134b5f>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
          [<00000000d20248ad>] do_ip_setsockopt.isra.0+0x19fe/0x1c00 net/ipv4/ip_sockglue.c:957
          [<000000003d367be7>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
          [<000000003c7c76af>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
          [<000000000c1aeb23>] sock_common_setsockopt+0x3e/0x50 net/core/sock.c:3130
          [<000000000157b92b>] __sys_setsockopt+0x9e/0x120 net/socket.c:2078
          [<00000000a9f3d058>] __do_sys_setsockopt net/socket.c:2089 [inline]
          [<00000000a9f3d058>] __se_sys_setsockopt net/socket.c:2086 [inline]
          [<00000000a9f3d058>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
          [<000000001b8da885>] do_syscall_64+0x7c/0x1a0 arch/x86/entry/common.c:301
          [<00000000ba770c62>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      now they will see this:
      
      unreferenced object 0xffff88805413c800 (size 64):
        comm "syz-executor.4", pid 8960, jiffies 4294994003 (age 14.350s)
        hex dump (first 32 bytes):
          00 7a 8a 57 80 88 ff ff e0 00 00 01 00 00 00 00  .z.W............
          00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000c5d3be64>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
          [<00000000c5d3be64>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<00000000c5d3be64>] slab_alloc mm/slab.c:3326 [inline]
          [<00000000c5d3be64>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
          [<0000000023865be2>] kmalloc include/linux/slab.h:547 [inline]
          [<0000000023865be2>] kzalloc include/linux/slab.h:742 [inline]
          [<0000000023865be2>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
          [<0000000023865be2>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
          [<000000003029a9d4>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
          [<00000000ccd0a87c>] do_ip_setsockopt.isra.0+0x19fe/0x1c00 net/ipv4/ip_sockglue.c:957
          [<00000000a85a3785>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
          [<00000000ec13c18d>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
          [<0000000052d748e3>] sock_common_setsockopt+0x3e/0x50 net/core/sock.c:3130
          [<00000000512f1014>] __sys_setsockopt+0x9e/0x120 net/socket.c:2078
          [<00000000181758bc>] __do_sys_setsockopt net/socket.c:2089 [inline]
          [<00000000181758bc>] __se_sys_setsockopt net/socket.c:2086 [inline]
          [<00000000181758bc>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
          [<00000000d4b73623>] do_syscall_64+0x7c/0x1a0 arch/x86/entry/common.c:301
          [<00000000c1098bec>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Link: http://lkml.kernel.org/r/20190517171507.96046-1-dvyukov@gmail.comSigned-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d90e2ab5
    • Ira Weiny's avatar
      mm/swap: fix release_pages() when releasing devmap pages · 2df1bd94
      Ira Weiny authored
      [ Upstream commit c5d6c45e ]
      
      release_pages() is an optimized version of a loop around put_page().
      Unfortunately for devmap pages the logic is not entirely correct in
      release_pages().  This is because device pages can be more than type
      MEMORY_DEVICE_PUBLIC.  There are in fact 4 types, private, public, FS DAX,
      and PCI P2PDMA.  Some of these have specific needs to "put" the page while
      others do not.
      
      This logic to handle any special needs is contained in
      put_devmap_managed_page().  Therefore all devmap pages should be processed
      by this function where we can contain the correct logic for a page put.
      
      Handle all device type pages within release_pages() by calling
      put_devmap_managed_page() on all devmap pages.  If
      put_devmap_managed_page() returns true the page has been put and we
      continue with the next page.  A false return of put_devmap_managed_page()
      means the page did not require special processing and should fall to
      "normal" processing.
      
      This was found via code inspection while determining if release_pages()
      and the new put_user_pages() could be interchangeable.[1]
      
      [1] https://lkml.kernel.org/r/20190523172852.GA27175@iweiny-DESK2.sc.intel.com
      
      Link: https://lkml.kernel.org/r/20190605214922.17684-1-ira.weiny@intel.com
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: John Hubbard's avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      2df1bd94