Skip to content

percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing

Audra Mitchell requested to merge aubaker/centos-stream-9:rhel15605 into main

JIRA: https://issues.redhat.com/browse/RHEL-15605

This patch is a backport of the following upstream commit:
commit 3a6358c0dbe6a286a4f4504ba392a6039a9fbd12
Author: Yu Ma yu.ma@intel.com
Date: Fri Jun 9 23:07:30 2023 -0400

percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing  

When running UnixBench/Execl throughput case, false sharing is observed  
due to frequent read on base_addr and write on free_bytes, chunk_md.  

UnixBench/Execl represents a class of workload where bash scripts are  
spawned frequently to do some short jobs.  It will do system call on execl  
frequently, and execl will call mm_init to initialize mm_struct of the  
process.  mm_init will call __percpu_counter_init for percpu_counters  
initialization.  Then pcpu_alloc is called to read the base_addr of  
pcpu_chunk for memory allocation.  Inside pcpu_alloc, it will call  
pcpu_alloc_area to allocate memory from a specified chunk.  This function  
will update "free_bytes" and "chunk_md" to record the rest free bytes and  
other meta data for this chunk.  Correspondingly, pcpu_free_area will also  
update these 2 members when free memory.  

Call trace from perf is as below:  
+   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init  
+   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc  
-   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock  
   - 53.54% 0x654278696e552f34  
        main  
        __execve  
        entry_SYSCALL_64_after_hwframe  
        do_syscall_64  
        __x64_sys_execve  
        do_execveat_common.isra.47  
        alloc_bprm  
        mm_init  
        __percpu_counter_init  
        pcpu_alloc  
      - __mutex_lock.isra.17  

In current pcpu_chunk layout, `base_addr' is in the same cache line with  
`free_bytes' and `chunk_md', and `base_addr' is at the last 8 bytes.  This  
patch moves `bound_map' up to `base_addr', to let `base_addr' locate in a  
new cacheline.  

With this change, on Intel Sapphire Rapids 112C/224T platform, based on  
v6.4-rc4, the 160 parallel score improves by 24%.  

The pcpu_chunk struct is a backing data structure per chunk, so the  
additional memory should not be dramatic.  A chunk covers ballpark  
between 64kb and 512kb memory depending on some config and boot time  
stuff, so I believe the additional memory used here is nominal at best.  

Working the #s on my desktop:  
Percpu:            58624 kB  
28 cores -> ~2.1MB of percpu memory.  
At say ~128KB per chunk -> 33 chunks, generously 40 chunks.  
Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB  
of overhead?  

I believe we can do a little better to avoid eating that full padding,  
so likely less than that.  

[dennis@kernel.org: changelog details]  
Link: https://lkml.kernel.org/r/20230610030730.110074-1-yu.ma@intel.com  
Signed-off-by: Yu Ma <yu.ma@intel.com>  
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>  
Acked-by: Dennis Zhou <dennis@kernel.org>  
Cc: Dan Williams <dan.j.williams@intel.com>  
Cc: Dave Hansen <dave.hansen@intel.com>  
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>  
Cc: Shakeel Butt <shakeelb@google.com>  
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>  

Signed-off-by: Audra Mitchell audra@redhat.com

Merge request reports