Add ability to collect Ruby heap dumps
We found that restarting workers based on high heap fragmentation is only effective on api nodes. Moreover, we are seeing what appears to be substantial memory leaks in web nodes.
We cannot diagnose these without pulling heap dumps. Since it is difficult to time this even with help from an SRE (it would have to happen before the process dies, likely during a weekend), I think we should build heap dump collection straight into the application.
My proposal is to:
-
MR1: Add a new life-cycle hook on_worker_stopthat is called when a Puma or Sidekiq worker is about to shut down (!103372 (merged)) -
MR2: Wiring: Leverage memory-watchdogto signal the worker that it should dumpObjectSpacebefore shutting down (!103957 (merged)). This does not yet write heap dumps. -
MR3: Refactor - extract shared logic from ReportsDaemoninto a newReporterclass: !104264 (merged) -
MR4: Refactor - extract shared logic from Jemallocreport intoReporter: !104727 (merged) -
MR5: Add gzipsupport toReporterfile streaming logic: !105115 (merged) -
MR6: Implement HeapDumpreport method to produce an object space dump: !106406 (merged)
The uploader will then pick this up and put it into GCS (this was done in #362902 (closed))
Edited by Matthias Käppler