Memory watchdog should restart high-memory workers
We found in #365950 (closed) that reaping workers purely on high heap fragmentation is useful only for certain parts of our production fleet.
To replace puma-worker-killer, I think we should look at other metrics that indicate "bad behavior". We should likely not use RSS, and much less so a fixed budget that we need to maintain (this has caused both confusion and extra work in the past).
Some ideas:
- Reap workers based on relative growth. For instance, we could capture master RSS prior to forking. If the watchdog observes the process exceeds some multiple of master RSS, it will issue a kill. This has the benefit that it places a cap on RSS, but it is a relative budget and scales over time.
-
Reap workers based on suspected memory leaks. We have instances where we allocate millions of objects that are never freed again. This happens in response to requests, not during application start. We could have the watchdog observe
live slot
growth, and restart workers if this grows steadily over a period of time.
Edited by Matthias Käppler