Skip to content

Add plumbing for dumping heap on worker shutdown

Matthias Käppler requested to merge 370077-heap-dump-plumbing into master

What does this MR do and why?

This is part of #370077 (closed)

We are adding functionality to dump Ruby's object space when killing Puma or Sidekiq workers due to high memory use. These heap dumps will then be picked up by our diagnostics report uploader and stored in GCS. We are splitting this work into three increments:

  • Add a on_worker_stop life-cycle hook: !103372 (merged)
  • This MR: Add plumbing that connects the memory-watchdog violation handlers with the shutdown hook and call into an empty HeapDump module. It currently does nothing except for emitting a log event, so that we can actually see all the wires connect properly when deploying this.
  • Next: implement the HeapDump module to write a compressed ObjectSpace dump.

Since this is just wiring and calling stubs, it should be safe to deploy without a feature flag. I intend to introduce a new ops toggle as part of the final step, which will put the actual meat on this walking skeleton.

The overall approach is:

  1. memory-watchdog decides to terminate a worker due to a memory violation.
  2. If heap dumps are enabled, it "enqueues" it by setting a flag. We don't dump heap here and now because the worker might still be busy serving requests.
  3. memory-watchdog signals TERM so the worker can start its shutdown procedure.
  4. Puma emits a worker_shutdown event; we receive it, and call into the HeapDump report module.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

This is a bit tricky to set up locally:

  1. Set GITLAB_MEMORY_WATCHDOG_ENABLED: '1' env var
  2. Set GITLAB_DIAGNOSTIC_REPORTS_ENABLED: '1' env var
  3. Set GITLAB_MEMWD_DUMP_HEAP: '1' env var
  4. Enable Feature.enable(:gitlab_memory_watchdog)
  5. Enable Feature.enable(:enforce_memory_watchdog)
  6. Wait for or trigger a high memory use event (this can be "forced" by setting e.g. GITLAB_MEMWD_MAX_HEAP_FRAG to a very low value (or zero).
  7. Tail log/application_json.log

You should see something like:

{"severity":"WARN","time":"2022-11-16T08:12:10.173Z","correlation_id":null,"pid":201,"worker_id":"puma_0","memwd_handler_class":"Gitlab::Memory::Watchdog::PumaHandler","memwd_sleep_time_s":5,"memwd_rss_bytes":621776896,"memwd_max_strikes":3,"memwd_cur_strikes":4,"message":"heap fragmentation limit exceeded","memwd_cur_heap_frag":0.05522292052969269,"memwd_max_heap_frag":0.02}
{"severity":"INFO","time":"2022-11-16T08:12:10.174Z","correlation_id":null,"message":"enqueue","pid":201,"worker_id":"puma_0","perf_report":"heap_dump"}
{"severity":"INFO","time":"2022-11-16T08:12:10.174Z","correlation_id":null,"message":"write","pid":201,"worker_id":"puma_0","perf_report":"heap_dump"}
{"severity":"INFO","time":"2022-11-16T08:12:10.178Z","correlation_id":null,"pid":201,"worker_id":"puma_0","memwd_handler_class":"Gitlab::Memory::Watchdog::PumaHandler","memwd_sleep_time_s":5,"memwd_rss_bytes":618180608,"message":"stopped"}

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #370077 (closed)

Edited by Matthias Käppler

Merge request reports