mm/mglru: Revert "don't sync disk for each aging cycle"
JIRA: https://issues.redhat.com/browse/RHEL-43371
Upstream Status: RHEL only
Since the 9.4 mm update to upstream v6.1, it was found that premature OOM kills happened much more frequently for tasks in a memory constrained cgroup. As shown in the Jira ticket, one easy way to reproduce it to write a large amount of random data to a NFS mounted filesystem.
After doing bisection, the culprit is found to be commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle"). Upstream also had some discussions about this premature OOM problem in [1]. The purpose of this commit is to prevent SSD wearout as it may breach the rate limit a system wants to impose on writeback.
However this can cause a serious problem for the OCP environment where most of the containers are under control by memory cgroup. Premature OOM kills will have a great impact on OCP reducing its reliability and stability. Revert this problematic commit for now so that wakeup_flusher_threads() will be called again on every generation bump. This will be reverted once upstream comes up with a better fix.
Before the patch, the reproducer shown in the Jira ticket will run once successfully and get OOM-killed in the 2nd run. The write data rate on a certain test system was:
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 57.5474 s, 37.3 MB/s
After applying this patch, the reproducer could be run multiple times without causing OOM kill. The new write data rate was:
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 25.694 s, 83.6 MB/s
The write throughput increased by more than double.
By disabling MGLRU (CONFIG_LRU_GEN=n), the write data rate was:
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 21.184 s, 101 MB/s
This has even better throughput. So some improvement may still be needed in the MGLRU code to match the non-MGLRU environment.
[1] https://lore.kernel.org/lkml/ZcWOh9u3uqZjNFMa@chrisdown.name/
Signed-off-by: Waiman Long longman@redhat.com