Skip to content

sched/fair: Make the BW replenish timer expire in hardirq context for PREEMPT_RT

JIRA: https://issues.redhat.com/browse/RHEL-7232 Upstream Status: RHEL-Only

RHEL-Only considerations: This doesn't fly upstream. An alternative fix is in the works (throttling only on kernel exit), but it's going to take weeks if not months to get that in a mergeable state.

This change can be easily reverted once a better option is there and won't hinder backports too much. Stock is unaffected as its default expiry mode for timers is hardirq.

Consider the following scenario under PREEMPT_RT: o A CFS task p0 gets throttled while holding read_lock(&lock) o A task p1 blocks on write_lock(&lock), making further readers enter the slowpath o A ktimers or ksoftirqd task blocks on read_lock(&lock)

If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued on the same CPU as one where ktimers/ksoftirqd is blocked on read_lock(&lock), this creates a circular dependency.

This has been observed to happen with: o fs/eventpoll.c::ep->lock o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above) but can trigger with any rwlock that can be acquired in both process and softirq contexts.

The linux-rt tree has had 1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.") which helped this scenario for non-rwlock locks by ensuring the throttled task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately, rwlocks cannot sanely do PI as they allow multiple readers.

Make the period_timer expire in hardirq context under PREEMPT_RT.

Link: https://lore.kernel.org/all/20231030145104.4107573-1-vschneid@redhat.com/ Signed-off-by: Valentin Schneider vschneid@redhat.com

Edited by Valentin Schneider

Merge request reports