Skip to content

mm/memcg: Allow OOM eventfd notifications under PREEMPT_RT

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174178 Upstream-status: RHEL-only

Context

Per the upstream patchset:

https://lore.kernel.org/all/20220226204144.1008339-4-bigeasy@linutronix.de/T/#mfb405a56eca687c82a2cb1eb5c83ffd540c29e1a cgroup.event_control / memory.soft_limit_in_bytes is disabled on PREEMPT_RT. It is a deprecated v1 feature. Fixing the signal path is not worth it.

The problematic pattern is

local_irq_disable(); mem_cgroup_charge_statistics(memcg, nr_pages); memcg_check_events(memcg, folio_nid(folio)); local_irq_enable();

mem_cgroup_charge_statistics() has been turned RT-safe, but memcg_check_events() hasn't and immediately returns for RT.

memcg_check_events() is problematic for several reasons: o mem_cgroup_event_ratelimit() reads a set of percpu variables, which in this scenario requires IRQ disabled o mem_cgroup_threshold() can send an eventfd_signal(), which acquires a non-raw spinlock o mem_cgroup_update_tree() acquires a non-raw spin_lock

Making these RT-compatible would require moving memcg_check_events() out of the local_irq_{disable, enable}() region, and adding finer-grained IRQ disabled regions within to protect mem_cgroup_event_ratelimit() and potentially mem_cgroup_update_tree() as well.

Furthermore, as mem_cgroup_threshold normally runs with IRQs disabled under !PREEMPT_RT, it is not entirely clear whether running it with IRQs enabled is actually safe. Conversely, making the eventfd_ctx spinlock raw is a no-go given its relative widespread use (~50 callsites).

A note on cgroupv2

cgroupv2 memcg doesn't have any of these issues, as events are recorded via memcg_memory_event(), which doesn't sit in an IRQ-off region. It leverages atomic increments, which doesn't require disabling IRQs or preemption.

Changes

Threshold events signaled via memcg_check_events() are problematic for PREEMPT_RT, but OOM events are different: they happen via

try_charge_memcg() `
mem_cgroup_oom()

and don't involve per-CPU stats or IRQ/preemption disabled regions. Those are thus safe for PREEMPT_RT - re-enable them.

This is effectively a partial revert of upstream commit

2343e88d238f ("mm/memcg: disable threshold event handlers on PREEMPT_RT")

which only allows OOM eventfd notifications under PREEMPT_RT.

This remains RHEL-only as cgroupv1 is in life support mode upstream, and cgroupv1 memcg is clearly marked as deprecated, cf:

3bc942f3 ("memcg: rename cgroup_event to mem_cgroup_event")

Signed-off-by: Valentin Schneider vschneid@redhat.com

Merge request reports