mm/memcg: Allow OOM eventfd notifications under PREEMPT_RT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2174178 Upstream-status: RHEL-only
Context
Per the upstream patchset:
https://lore.kernel.org/all/20220226204144.1008339-4-bigeasy@linutronix.de/T/#mfb405a56eca687c82a2cb1eb5c83ffd540c29e1a cgroup.event_control / memory.soft_limit_in_bytes is disabled on PREEMPT_RT. It is a deprecated v1 feature. Fixing the signal path is not worth it.
The problematic pattern is
local_irq_disable(); mem_cgroup_charge_statistics(memcg, nr_pages); memcg_check_events(memcg, folio_nid(folio)); local_irq_enable();
mem_cgroup_charge_statistics() has been turned RT-safe, but memcg_check_events() hasn't and immediately returns for RT.
memcg_check_events() is problematic for several reasons: o mem_cgroup_event_ratelimit() reads a set of percpu variables, which in this scenario requires IRQ disabled o mem_cgroup_threshold() can send an eventfd_signal(), which acquires a non-raw spinlock o mem_cgroup_update_tree() acquires a non-raw spin_lock
Making these RT-compatible would require moving memcg_check_events() out of the local_irq_{disable, enable}() region, and adding finer-grained IRQ disabled regions within to protect mem_cgroup_event_ratelimit() and potentially mem_cgroup_update_tree() as well.
Furthermore, as mem_cgroup_threshold normally runs with IRQs disabled under !PREEMPT_RT, it is not entirely clear whether running it with IRQs enabled is actually safe. Conversely, making the eventfd_ctx spinlock raw is a no-go given its relative widespread use (~50 callsites).
A note on cgroupv2
cgroupv2 memcg doesn't have any of these issues, as events are recorded via memcg_memory_event(), which doesn't sit in an IRQ-off region. It leverages atomic increments, which doesn't require disabling IRQs or preemption.
Changes
Threshold events signaled via memcg_check_events() are problematic for PREEMPT_RT, but OOM events are different: they happen via
try_charge_memcg()
`
mem_cgroup_oom()
and don't involve per-CPU stats or IRQ/preemption disabled regions. Those are thus safe for PREEMPT_RT - re-enable them.
This is effectively a partial revert of upstream commit
2343e88d238f ("mm/memcg: disable threshold event handlers on PREEMPT_RT")
which only allows OOM eventfd notifications under PREEMPT_RT.
This remains RHEL-only as cgroupv1 is in life support mode upstream, and cgroupv1 memcg is clearly marked as deprecated, cf:
3bc942f3 ("memcg: rename cgroup_event to mem_cgroup_event")
Signed-off-by: Valentin Schneider vschneid@redhat.com