• Yang Shi's avatar
    mm: mmu_gather: remove __tlb_reset_range() for force flush · 7a30df49
    Yang Shi authored
    A few new fields were added to mmu_gather to make TLB flush smarter for
    huge page by telling what level of page table is changed.
    
    __tlb_reset_range() is used to reset all these page table state to
    unchanged, which is called by TLB flush for parallel mapping changes for
    the same range under non-exclusive lock (i.e.  read mmap_sem).
    
    Before commit dd2283f2 ("mm: mmap: zap pages with read mmap_sem in
    munmap"), the syscalls (e.g.  MADV_DONTNEED, MADV_FREE) which may update
    PTEs in parallel don't remove page tables.  But, the forementioned
    commit may do munmap() under read mmap_sem and free page tables.  This
    may result in program hang on aarch64 reported by Jan Stancek.  The
    problem could be reproduced by his test program with slightly modified
    below.
    
    ---8<---
    
    static int map_size = 4096;
    static int num_iter = 500;
    static long threads_total;
    
    static void *distant_area;
    
    void *map_write_unmap(void *ptr)
    {
    	int *fd = ptr;
    	unsigned char *map_address;
    	int i, j = 0;
    
    	for (i = 0; i < num_iter; i++) {
    		map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
    			MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    		if (map_address == MAP_FAILED) {
    			perror("mmap");
    			exit(1);
    		}
    
    		for (j = 0; j < map_size; j++)
    			map_address[j] = 'b';
    
    		if (munmap(map_address, map_size) == -1) {
    			perror("munmap");
    			exit(1);
    		}
    	}
    
    	return NULL;
    }
    
    void *dummy(void *ptr)
    {
    	return NULL;
    }
    
    int main(void)
    {
    	pthread_t thid[2];
    
    	/* hint for mmap in map_write_unmap() */
    	distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
    			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    	munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
    	distant_area += DISTANT_MMAP_SIZE / 2;
    
    	while (1) {
    		pthread_create(&thid[0], NULL, map_write_unmap, NULL);
    		pthread_create(&thid[1], NULL, dummy, NULL);
    
    		pthread_join(thid[0], NULL);
    		pthread_join(thid[1], NULL);
    	}
    }
    ---8<---
    
    The program may bring in parallel execution like below:
    
            t1                                        t2
    munmap(map_address)
      downgrade_write(&mm->mmap_sem);
      unmap_region()
      tlb_gather_mmu()
        inc_tlb_flush_pending(tlb->mm);
      free_pgtables()
        tlb->freed_tables = 1
        tlb->cleared_pmds = 1
    
                                            pthread_exit()
                                            madvise(thread_stack, 8M, MADV_DONTNEED)
                                              zap_page_range()
                                                tlb_gather_mmu()
                                                  inc_tlb_flush_pending(tlb->mm);
    
      tlb_finish_mmu()
        if (mm_tlb_flush_nested(tlb->mm))
          __tlb_reset_range()
    
    __tlb_reset_range() would reset freed_tables and cleared_* bits, but this
    may cause inconsistency for munmap() which do free page tables.  Then it
    may result in some architectures, e.g.  aarch64, may not flush TLB
    completely as expected to have stale TLB entries remained.
    
    Use fullmm flush since it yields much better performance on aarch64 and
    non-fullmm doesn't yields significant difference on x86.
    
    The original proposed fix came from Jan Stancek who mainly debugged this
    issue, I just wrapped up everything together.
    
    Jan's testing results:
    
    v5.2-rc2-24-gbec7550c
    --------------------------
             mean     stddev
    real    37.382   2.780
    user     1.420   0.078
    sys     54.658   1.855
    
    v5.2-rc2-24-gbec7550c + "mm: mmu_gather: remove __tlb_reset_range() for force flush"
    ---------------------------------------------------------------------------------------_
             mean     stddev
    real    37.119   2.105
    user     1.548   0.087
    sys     55.698   1.357
    
    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: dd2283f2 ("mm: mmap: zap pages with read mmap_sem in munmap")
    Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: Jan Stancek's avatarJan Stancek <jstancek@redhat.com>
    Reported-by: Jan Stancek's avatarJan Stancek <jstancek@redhat.com>
    Tested-by: Jan Stancek's avatarJan Stancek <jstancek@redhat.com>
    Suggested-by: default avatarWill Deacon <will.deacon@arm.com>
    Tested-by: default avatarWill Deacon <will.deacon@arm.com>
    Acked-by: default avatarWill Deacon <will.deacon@arm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Nick Piggin <npiggin@gmail.com>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: <stable@vger.kernel.org>	[4.20+]
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    7a30df49
mmu_gather.c 6.76 KB