Investigate Puma long-term memory use

Problem to solve

We've noticed before that Puma can see climbing memory use in production over a period of days, especially during weekends, where there are no deploys that would reset these gauges. Meanwhile, we are getting an increased number of customer reports complaining that the Puma memory killer frequently kills Puma processes, which causes throughput issues.

We should see if we can find out the reason for any potential runaway memory issues, be it just Ruby heap fragmentation or a memory leak.

The issue does not seem as pronounced on Sidekiq, where (across all fleets), memory seems to plateau at some point:

Source

Proposal

A short-term measure is always to tune puma-memory-killer so as to recycle workers frequently. However, that is just a band-aid; we should try to understand the root cause of this.

An idea for how to approach this:

Ensure there are no deploys during a set time-frame (e.g. the weekend)
Identify a specific pod to target and obtain worker heap dumps
Wait for time period to pass
Take a second heap dump
If possible, quarantine the pod
Compare/analyze heap dumps for memory growth

We will need to work with an SRE to do this, since pulling heap dumps requires rbtrace, which requires access at the machine and process level.

Links / references

#334831 (closed)