Skip to content

Sidekiq: Fork into metrics-server instead of `exec`ing

Matthias Käppler requested to merge 347199-fork-into-metrics-server into master

What does this MR do and why?

We recently broke out the in-process metrics server for Sidekiq into its own process. Eventually we are looking to reuse this server in Puma as well.

In the first iteration of metrics-server (!74875 (merged), !75247 (merged)), we used Process#spawn to fork from the parent process. That is simple, but can be inefficient because it translates into an OS level clone that creates a new memory map, and almost no memory is shared.

We found in #347199 (closed) that even after mutating some memory regions e.g. by serving several metrics requests, unique pages drop by an order of magnitude, and ~40-50 out of 50-60MB end up being shared.

Therefore, we now fork from sidekiq-cluster instead of using spawn. I kept the bin/metrics-server script for now since it's quite useful for testing the server both in end-to-end automated tests but also manually, without having to launch a sidekiq cluster.

I also cleaned up a few unrelated issues both in tests and code that went unnoticed in the initial implementation of this. I left comments accordingly.

Memory savings

Letting sidekiq-cluster and the server run for a while and sending multiple requests to /metrics, we see the following results:

git@6e8215dd1600:~/gitlab$ smem -P 'sidekiq-cluster'
PID User     Command                         Swap      USS      PSS      RSS 
  199 git      /usr/bin/python /usr/bin/sm        0     8632     9022    12228 
   69 git      ruby /home/git/gitlab/bin/s        0    11028    28128    50912 
   71 git      ruby /home/git/gitlab/bin/s        0    25792    42401    63152

PID 71 is the metrics server, forked from PID 69.

I looked at the memory maps of 71 as well, and shared pages sum up to about 40MB:

71:   ruby /home/git/gitlab/bin/sidekiq-cluster * -P /home/git/gitlab/tmp/pids/sidekiq-cluster.pid -e development -e development
... Size KernelPageSize MMUPageSize   Rss   Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous ...
...
   ====== ============== =========== ===== ===== ============ ============ ============= ============= ========== ========= 
   270960           1244        1244 61864 41122         6124        31184             0         24556      36720     55740
Shared_Clean + Shared_Dirty = 6124 + 31184 =~ 37MB

It is unclear which process has written to shared pages and to what extent that would keep happening over the life-time of those processes, so this will likely shrink over time.

Before

Comparing this to memory use on master:

git@00817cb3936d:~/gitlab$ smem -P 'sidekiq-cluster|metrics-server'
  PID User     Command                         Swap      USS      PSS      RSS 
  166 git      /usr/bin/python /usr/bin/sm        0     8652     8946    12352 
   69 git      ruby /home/git/gitlab/bin/s        0    42340    43434    51028 
   71 git      ruby /home/git/gitlab/bin/m        0    68616    69738    77500

Almost all pages in both processes are unique to the process i.e. private anon (USS = unique set size). Those pages are not shared, hence bloat physical memory use by that amount.

How to set up and validate locally

To run the server manually:

  1. run: METRICS_SERVER_TARGET=sidekiq bin/metrics-server
  2. verify: curl localhost:3807/metrics

The response might be empty, unless there are metrics db files left over from previous sidekiq runs.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #347199 (closed)

Edited by Matthias Käppler

Merge request reports