Skip to content

Add ability to collect Ruby heap dumps

We found that restarting workers based on high heap fragmentation is only effective on api nodes. Moreover, we are seeing what appears to be substantial memory leaks in web nodes.

We cannot diagnose these without pulling heap dumps. Since it is difficult to time this even with help from an SRE (it would have to happen before the process dies, likely during a weekend), I think we should build heap dump collection straight into the application.

My proposal is to:

  • MR1: Add a new life-cycle hook on_worker_stop that is called when a Puma or Sidekiq worker is about to shut down (!103372 (merged))
  • MR2: Wiring: Leverage memory-watchdog to signal the worker that it should dump ObjectSpace before shutting down (!103957 (merged)). This does not yet write heap dumps.
  • MR3: Refactor - extract shared logic from ReportsDaemon into a new Reporter class: !104264 (merged)
  • MR4: Refactor - extract shared logic from Jemalloc report into Reporter: !104727 (merged)
  • MR5: Add gzip support to Reporter file streaming logic: !105115 (merged)
  • MR6: Implement HeapDump report method to produce an object space dump: !106406 (merged)

The uploader will then pick this up and put it into GCS (this was done in #362902 (closed))

Edited by Matthias Käppler