Can we use Sidekiq Memory Killer to track workers that cause OOM Kills
Sidekiq Memory Killer
The documentation suggests that the Sidekiq MemoryKiller is enabled by default only for Omnibus packages for self-managed, but it seems that it's enabled for Kubernetes for urgent workloads in gprd.
It would be really useful if we could be able to track which job was being executed when a sidekick memory killer stepped in. This would establish a direct relationship between jobs that have high memory use and memory killer events.
This will be easier to do when using an in-application memory killer, not the Linux OOM killer.
The idea is to improve the Sidekiq memory killer, to maybe introduce an additional soft limit that will
serve just as a "recording mode", and it could help us with realizing what's going on, how to better tune the soft limits, correlation with jobs that were currently running when the soft limit was reached. This would potentially help us to also determine the direction in which we would like to further change the Sidekiq memory killer
and make it smarter.
As @mkaeppler
suggested, it would be nice if we could:
For instance, I would like to get away from duplicating memory limits between the Rails monolith and our execution environments (here, Kubernetes.) Instead, we could always let it run up to say 90% of container memory before we issue a kill. This can be computed dynamically at runtime instead of static configuration using a figure in bytes.