Add plumbing for dumping heap on worker shutdown
What does this MR do and why?
This is part of #370077 (closed)
We are adding functionality to dump Ruby's object space when killing Puma or Sidekiq workers due to high memory use. These heap dumps will then be picked up by our diagnostics report uploader and stored in GCS. We are splitting this work into three increments:
-
Add a on_worker_stop
life-cycle hook: !103372 (merged) -
This MR: Add plumbing that connects the memory-watchdog
violation handlers with the shutdown hook and call into an emptyHeapDump
module. It currently does nothing except for emitting a log event, so that we can actually see all the wires connect properly when deploying this. -
Next: implement the HeapDump
module to write a compressedObjectSpace
dump.
Since this is just wiring and calling stubs, it should be safe to deploy without a feature flag. I intend to introduce a new ops
toggle as part of the final step, which will put the actual meat on this walking skeleton.
The overall approach is:
-
memory-watchdog
decides to terminate a worker due to a memory violation. - If heap dumps are enabled, it "enqueues" it by setting a flag. We don't dump heap here and now because the worker might still be busy serving requests.
-
memory-watchdog
signalsTERM
so the worker can start its shutdown procedure. - Puma emits a
worker_shutdown
event; we receive it, and call into theHeapDump
report module.
Screenshots or screen recordings
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
How to set up and validate locally
This is a bit tricky to set up locally:
- Set
GITLAB_MEMORY_WATCHDOG_ENABLED: '1'
env var - Set
GITLAB_DIAGNOSTIC_REPORTS_ENABLED: '1'
env var - Set
GITLAB_MEMWD_DUMP_HEAP: '1'
env var - Enable
Feature.enable(:gitlab_memory_watchdog)
- Enable
Feature.enable(:enforce_memory_watchdog)
- Wait for or trigger a high memory use event (this can be "forced" by setting e.g.
GITLAB_MEMWD_MAX_HEAP_FRAG
to a very low value (or zero). - Tail
log/application_json.log
You should see something like:
{"severity":"WARN","time":"2022-11-16T08:12:10.173Z","correlation_id":null,"pid":201,"worker_id":"puma_0","memwd_handler_class":"Gitlab::Memory::Watchdog::PumaHandler","memwd_sleep_time_s":5,"memwd_rss_bytes":621776896,"memwd_max_strikes":3,"memwd_cur_strikes":4,"message":"heap fragmentation limit exceeded","memwd_cur_heap_frag":0.05522292052969269,"memwd_max_heap_frag":0.02}
{"severity":"INFO","time":"2022-11-16T08:12:10.174Z","correlation_id":null,"message":"enqueue","pid":201,"worker_id":"puma_0","perf_report":"heap_dump"}
{"severity":"INFO","time":"2022-11-16T08:12:10.174Z","correlation_id":null,"message":"write","pid":201,"worker_id":"puma_0","perf_report":"heap_dump"}
{"severity":"INFO","time":"2022-11-16T08:12:10.178Z","correlation_id":null,"pid":201,"worker_id":"puma_0","memwd_handler_class":"Gitlab::Memory::Watchdog::PumaHandler","memwd_sleep_time_s":5,"memwd_rss_bytes":618180608,"message":"stopped"}
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #370077 (closed)