Skip to content

Add watchdog to observe memory fragmentation

Matthias Käppler requested to merge 365950-memory-watchdog into master

What does this MR do and why?

We currently run two in-application memory killers at GitLab (with varying success):

  • Puma worker killer. This is a 3rd-party gem we use to reap Puma workers should they overrun a certain total RSS budget (the primary included). We do not actually run this in production currently.
  • Sidekiq memory killer. This is home-grown but works similarly to PWK.

We run these for two reasons:

  • In unmanaged deployments (from-source, Omnibus), Rails/Ruby can run up to several GB of memory over time, and the last resort is for Linux to OOM-kill these processes should the system run out of memory.
  • In managed deployments (kubernetes), we already have container and pod schedulers that enforce resource limits, but they simply pull the rug out when they see a memory limit breached, so we like to avoid that.

So we like to have a mechanism in place that tries to pre-empt any system-level OOM killers since these do not result in soft landings but rather KILL the targeted resource. Staying on top of this in the application itself allows us to anticipate low memory situations and more gracefully shut down a process (for instance, TERMing a Sidekiq worker allows it to finish its current job, and for Puma workers a grace period for orderly shutdown is used.)

So much about "the good parts". Now about the bad parts.

I summarized in #365950 (closed) why I think these two solutions are not ideal, but to TLDR it:

  • Using RSS is problematic, because:
    • It does not account for shared memory, but in pre-fork servers like Puma, processes can share hundreds of MB of memory.
    • cgroup memory is observed in Kubernetes using working_set_bytes, not RSS. This means the two approaches use an apples-to-oranges comparison to make decisions about memory use.

We furthermore found in &8105 that high memory use is largely a function of low heap utilization, or conversely, high heap fragmentation. We added a metric in #365252 (closed) to measure this precisely.

This MR proposes a new "memory killer" implementation (I called it "memory watchdog" because dogs are much cuter than killers 🐕) that does not observe memory based on process RSS, but on how fragmented the Ruby heap is. I describe the approach below.

Implementation

This implementation is a Daemon thread that runs in the background perpetually. It will wake up every N seconds and poll gc_heap_fragmentation, which is a percentage of how poorly utilized the Ruby heap pages are that are currently in memory.

Whenever it wakes up and finds that fragmentation is above a given threshold, it issues a strike. If fragmentation is below the threshold, it resets existing strikes. If the process has accumulated too many strikes, it sends out a callback to a Handler object. It is up to that handler to decide what to do next.

There are three handlers implemented:

  • NullHandler - does nothing.
  • PumaHandler - performs an orderly shutdown of a Puma worker via a WorkerHandle.
  • TermProcessHandler - sends SIGTERM, which we use for Sidekiq.

In this first iteration, I force "rehearsal mode" and only use the NullHandler, which does nothing. Since we also log these events, this will allow us to deploy this MR and observe in production how often the watchdog "barks". We can use this to tune its sensitivity before actually stopping processes.

Risk considerations

I've taken several steps to de-risk the deployment of this:

  • The watchdog will not start unless enabled via GITLAB_MEMORY_WATCHDOG_ENABLED
  • The watchdog will not reap workers currently and use the NullHandler instead. This makes sure we do not start reaping worker processes in production without any fine-tuning of the given limits.
  • When enabled, regardless of the handler used, there is an ops feature toggle that is checked every time the WD wakes up; it will skip the iteration if the flag is disabled. This allows us to turn the WD on and off in production in response to it over-reacting given the set limits, or because we found a bug.

Screenshots or screen recordings

Not much to show here, but here is how logs look like:

{"severity":"WARN","time":"2022-07-13T07:23:26.578Z","correlation_id":null,"pid":193,"worker_id":"puma_1","memwd_handler_class":"Gitlab::Memory::Watchdog::NullHandler","memwd_sleep_time_s":3,"memwd_max_heap_frag":0.1,"memwd_max_strikes":1,"memwd_cur_strikes":4,"message":"heap fragmentation limit exceeded","memwd_cur_heap_frag":0.2570456746683667}
{"severity":"WARN","time":"2022-07-13T07:23:35.594Z","correlation_id":null,"pid":193,"worker_id":"puma_1","memwd_handler_class":"Gitlab::Memory::Watchdog::NullHandler","memwd_sleep_time_s":3,"memwd_max_heap_frag":0.1,"memwd_max_strikes":1,"memwd_cur_strikes":2,"message":"heap fragmentation limit exceeded","memwd_cur_heap_frag":0.11796643560956588}
...
{"severity":"WARN","time":"2022-07-13T07:23:41.480Z","correlation_id":null,"pid":191,"worker_id":"puma_0","memwd_handler_class":"Gitlab::Memory::Watchdog::NullHandler","memwd_sleep_time_s":3,"memwd_max_heap_frag":0.1,"memwd_max_strikes":1,"memwd_cur_strikes":2,"message":"heap fragmentation limit exceeded","memwd_cur_heap_frag":0.20008158015186062}
...

How to set up and validate locally

  • Set the environment variable GITLAB_MEMORY_WATCHDOG_ENABLED=1
  • Enable the feature flag: Feature.enable(:gitlab_memory_watchdog)
  • The default config is fairly generous so most likely you will never see it kick in. You can use the following env vars to make it more sensitive to heap fragmentation and also wake up more frequently:
    • GITLAB_MEMWD_MAX_HEAP_FRAG (value of 0 to 1)
    • GITLAB_MEMWD_MAX_STRIKES (int)
    • GITLAB_MEMWD_SLEEP_TIME_SEC (int)

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #365950 (closed)

Edited by Matthias Käppler

Merge request reports