Investigate Puma long-term memory use
Problem to solve
We've noticed before that Puma can see climbing memory use in production over a period of days, especially during weekends, where there are no deploys that would reset these gauges. Meanwhile, we are getting an increased number of customer reports complaining that the Puma memory killer frequently kills Puma processes, which causes throughput issues.
We should see if we can find out the reason for any potential runaway memory issues, be it just Ruby heap fragmentation or a memory leak.
The issue does not seem as pronounced on Sidekiq, where (across all fleets), memory seems to plateau at some point:
Proposal
A short-term measure is always to tune puma-memory-killer so as to recycle workers frequently. However, that is just a band-aid; we should try to understand the root cause of this.
An idea for how to approach this:
- Ensure there are no deploys during a set time-frame (e.g. the weekend)
- Identify a specific pod to target and obtain worker heap dumps
- Wait for time period to pass
- Take a second heap dump
- If possible, quarantine the pod
- Compare/analyze heap dumps for memory growth
We will need to work with an SRE to do this, since pulling heap dumps requires rbtrace
, which requires access at the machine and process level.