Skip to content

mm/memcg: Free percpu stats memory of dying memcg's

Waiman Long requested to merge llong1/centos-stream-9:bz2176388_memcg into main

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2176388
Upstream Status: RHEL-only

For systems with large number of CPUs, the majority of the memory consumed by the mem_cgroup structure is actually the percpu stats memory. When a large number of memory cgroups are continuously created and destroyed (like in a container host), it is possible that more and more mem_cgroup structures remained in the dying state holding up increasing amount of percpu memory.

We can't free up the memory of the dying mem_cgroup structure due to active references mainly from pages in the page cache. However, the percpu stats memory allocated to that mem_cgroup is a different story.

There are 2 sets of percpu stat counters in the mem_cgroup structure and the associated mem_cgroup_per_node structure.

  • vmstats_percpu (struct mem_cgroup)
  • lruvec_stat_percpu (struct mem_cgroup_per_node)

There is discussion upstream about the best way to handle dying memory cgroups that hang around indefinitely, mostly due to shared memory. See https://lwn.net/Articles/932070/ for more information. It looks like a final solution may still need some more time.

This patch is a workaround by freeing the percpu stats memory associated with a dying memory cgroup. This will eliminates the percpu memory increase problem, but we will still see increase in slab memory consumption associated with the dying memory cgroups. As a workaround, it is not likely to be accepted upstream, but a lot of RHEL customers are seeing this percpu memory increase problem.

A new percpu_stats_disabled variable is added to keep track of the state of the percpu stats memory. If the variable is set, percpu stats update will be disabled for that particular memcg and forwarded to a parent memcg.

The disabling, flushing and freeing of the percpu stats memory is a multi-step process.

The percpu_stats_disabled variable is set to MEMCG_PERCPU_STATS_DISABLED first when the memcg is being set to an offline state. At this point, the cgroup filesystem control files corresponding to the offline cgroups is being removed and will no longer be visible in user space.

After a grace period with the help of rcu_work, no task should be reading or updating percpu stats at that point. The percpu_stats_disabled variable is then atomically set to PERCPU_STATS_FLUSHING before flushing out the percpu stats and changing its state to PERCPU_STATS_FLUSHED. The percpu memory is then freed and the state is changed to PERCPU_STATS_FREED.

This will greatly reduce the amount of memory held up by dying memory cgroups.

For the compiled RHEL9 kernel, memcg_vmstats_percpu and lruvec_stats_percpu have a size of 1080 and 672 bytes respectively. The mem_cgroup and mem_cgroup_per_node structures have a size of 2240 and 1096 bytes respectively. For a 2-socket 96-thread system, that means each dying memory cgroup use 232,704 bytes of percpu data and 3,338 bytes of memcg slab data. The percpu/slab ratio is 69. The ratio can be even higher for larger systems with many CPUs.

By freeing the percpu memory, the dying memory cgroups will now consume much less memory than before.

This patch does introduce a bit of performance overhead when doing memcg stat update especially __mod_memcg_lruvec_state().

This RHEL-only patch will be reverted when the upstream fix is finalized and being merged into RHEL9.

Signed-off-by: Waiman Long longman@redhat.com

Edited by Waiman Long

Merge request reports