Use external metrics server process for Puma
What does this MR do and why?
This is the main MR for #350548 (closed). It enables a code path for Puma in which we start spawning a separate server process to serve metrics into Prometheus. Until now, this server was running in a thread in the Puma primary.
This change is guarded by an environment variable: PUMA_EXTERNAL_METRICS_SERVER
. It is safe to deploy.
I tried to keep changes to the already complicated 7_prometheus
initializer minimal and moved the logic elsewhere.
This builds on top of ProcessSupervisor
, which is already used by Sidekiq to also handle its metrics server. However, I had to make it a Daemon
, since supervise
was a blocking call (that worked for sidekiq-cluster
because it already ran its own control thread). So there are some changes to sidekiq-cluster
in here.
Screenshots or screen recordings
The change should be completely transparent to users, but hey, here's a process listing!
git@ebcb4a3a1612:~/gitlab$ ps ax
PID TTY STAT TIME COMMAND
1 ? Ss 0:00 /usr/local/bin/dumb-init -- /scripts/entrypoint/gitlab-rails-env.sh /scripts/startup/web.sh
7 pts/0 Ssl+ 0:16 puma 5.6.2 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
78 pts/0 Sl 0:01 ruby /home/git/gitlab/bin/metrics-server
81 pts/0 Sl+ 0:05 puma: cluster worker 0: 7 [gitlab-puma-worker]
83 pts/0 Sl+ 0:05 puma: cluster worker 1: 7 [gitlab-puma-worker]
Additional memory use
This will naturally increase memory use on a Puma pod, but I found it to be small: !80191 (merged) (15MB of unique pages, although that will certainly be higher in production where we often serve several MB of metric data with a single scrape)
This is a conscious trade-off, since we are using memory as "currency" here to buy us improved fault isolation. We are looking to make the exporter more efficient by rewriting it in Go.
How to set up and validate locally (dev env)
- Make sure
monitoring.web_exporter.enabled
is true ingitlab.yml
- Set the
PUMA_EXTERNAL_METRICS_SERVER
to something truthy - Start Rails
- In
log/application_json.log
, you should see the line:{"severity":"INFO","time":"2022-03-09T13:59:12.887Z","correlation_id":null,"message":"Starting Puma metrics server with pid 78"}
- Metrics should be exported from the
/metrics
endpoint (NOT/-/metrics
-- that's a Rails controller). See config for which port is listens on.
This implies the server process is running with the given process ID. You can do some more testing with this such as:
- TERM the metrics server, e.g.
kill <pid>
. This should cause the process supervisor to restart it within 5 seconds or so (see log file for a message). - SIGUSR2 the Puma primary. This should stop the metrics server along with Puma, then restart it.
- For the giggles, kill a Puma worker. This should not affect the metrics server.
How to test in Omnibus
I also tested these changes with the image published for QA: fc0074b2ab355ea536a19180c5b7d0fa62a045f4
To enable the server, set the following in /etc/gitlab/gitlab.rb
:
gitlab_rails['env'] = {
"PUMA_EXTERNAL_METRICS_SERVER" => "1"
}
puma['exporter_enabled'] = true
puma['exporter_log_enabled'] = true
puma['exporter_address'] = "127.0.0.1"
puma['exporter_port'] = 8083
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #350548 (closed)