Allow spawning metrics_server instead of forking into it
What does this MR do and why?
This is a follow-up to !78527 (merged).
Sidekiq currently forks into the metrics_server
module on SaaS to serve metrics to Prometheus. This was done to leverage memory page sharing, and the parent process we fork from was already light-weight (it is a Ruby script called sidekiq-cluster
).
For Puma, we currently run an in-process metrics server, running in the Puma primary. In &7304 (closed) we are looking to extract this into a separate server process, as we did for Sidekiq.
However, I found that fork
ing into the server from the Puma primary is not desirable, for two reasons:
- I found it to be less memory efficient compared to Sidekiq (see explanation below).
- In the long run, we are looking to replace our Ruby exporters with a new application exporter written in Go. This means we will need to spawn a new process anyway, since forking will not be an option anymore.
This MR therefore adds a new function, MetricsServer.spawn
, which executes the bin/metrics-server
command instead of forking from the caller. Unfortunately, the forking variant of this method was previously called spawn, so I had to rename the old function to fork
, and the non-forking variant is called spawn
. Note that this function is not in active use yet outside of integration tests. This is merely paving the way to eventually spawn the server from the Puma primary in a follow-up.
Memory use
To see whether it is more efficient to fork
or spawn
, I looked at memory maps for both Puma and Sidekiq. What we want to focus on is the sum of unique pages across all processes in a process cluster, since this is unshared memory that will add to real memory use (RSS is very misleading in pre-fork systems, since much of the memory is shared between processes.)
Puma
We can see that puma_exporter
, when forked from the primary (pid 7) accounts for 114MB of unshared memory. The rest is shared roughly proportionally with the primary (PSS). When spawned into a new process with its own memory map, puma_exporter
consumes merely 14MB of unshared memory, an order of magnitude less compared to forking.
This can be explained by memory pages being dirtied by one of these processes post-fork, which triggers copy-on-write, expanding the overall memory used.
//FORK:
git@ced553165878:~/gitlab$ smem -P puma
PID User Command Swap USS PSS RSS
75 git puma_exporter 0 114360 227357 494232
7 git puma 5.5.2 (tcp://0.0.0.0:8 0 122920 240603 515152
80 git puma: cluster worker 1: 7 [ 0 349256 431325 669252
77 git puma: cluster worker 0: 7 [ 0 490452 570804 803872
//SPAWN:
git@119d7d022ed2:~/gitlab$ smem -P puma
PID User Command Swap USS PSS RSS
107 git puma_exporter 0 14308 35562 60996
7 git puma 5.5.2 (tcp://0.0.0.0:8 0 228064 343865 569912
79 git puma: cluster worker 1: 7 [ 0 420212 529280 748024
77 git puma: cluster worker 0: 7 [ 0 421492 536181 760512
Sidekiq
For comparison, I wanted to show that for Sidekiq, we get a very different picture. This is because Sidekiq does not use a pre-fork setup (there is no "primary Sidekiq"). It also uses a parent process wrapper from which workers are spawned, which itself is just a lightweight Ruby script (pid 70 in the process listing below).
We can see here that the forking model is still better for Sidekiq, since it results in >50% fewer unique pages (37MB vs 81MB) while RSS remains the same, meaning more memory is shared:
//FORK:
git@0b0a35a4d32d:~/gitlab$ smem -P sidekiq
PID User Command Swap USS PSS RSS
70 git ruby /home/git/gitlab/bin/s 0 14384 39763 69220
72 git sidekiq_exporter 0 22444 46188 73168
74 git sidekiq 6.4.0 queues:author 0 555560 562334 575196
//SPAWN:
git@0ec7e1b9cc8c:~/gitlab$ smem -P sidekiq
PID User Command Swap USS PSS RSS
70 git ruby /home/git/gitlab/bin/s 0 58348 61123 69124
76 git sidekiq_exporter 0 23844 44819 69660
74 git sidekiq 6.4.0 queues:author 0 558948 565345 578888
How to set up and validate locally
There are no material changes in behavior outside of some renaming of arguments, so no need to validate this manually.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #350548 (closed)