Skip to content

Use external metrics server process for Puma

Matthias Käppler requested to merge 350548-puma-metrics-server-process into master

What does this MR do and why?

This is the main MR for #350548 (closed). It enables a code path for Puma in which we start spawning a separate server process to serve metrics into Prometheus. Until now, this server was running in a thread in the Puma primary.

This change is guarded by an environment variable: PUMA_EXTERNAL_METRICS_SERVER. It is safe to deploy.

I tried to keep changes to the already complicated 7_prometheus initializer minimal and moved the logic elsewhere.

This builds on top of ProcessSupervisor, which is already used by Sidekiq to also handle its metrics server. However, I had to make it a Daemon, since supervise was a blocking call (that worked for sidekiq-cluster because it already ran its own control thread). So there are some changes to sidekiq-cluster in here.

Screenshots or screen recordings

The change should be completely transparent to users, but hey, here's a process listing!

git@ebcb4a3a1612:~/gitlab$ ps ax
    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:00 /usr/local/bin/dumb-init -- /scripts/entrypoint/gitlab-rails-env.sh /scripts/startup/web.sh
      7 pts/0    Ssl+   0:16 puma 5.6.2 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
     78 pts/0    Sl     0:01 ruby /home/git/gitlab/bin/metrics-server
     81 pts/0    Sl+    0:05 puma: cluster worker 0: 7 [gitlab-puma-worker]
     83 pts/0    Sl+    0:05 puma: cluster worker 1: 7 [gitlab-puma-worker]

🚀

Additional memory use

This will naturally increase memory use on a Puma pod, but I found it to be small: !80191 (merged) (15MB of unique pages, although that will certainly be higher in production where we often serve several MB of metric data with a single scrape)

This is a conscious trade-off, since we are using memory as "currency" here to buy us improved fault isolation. We are looking to make the exporter more efficient by rewriting it in Go.

How to set up and validate locally (dev env)

  1. Make sure monitoring.web_exporter.enabled is true in gitlab.yml
  2. Set the PUMA_EXTERNAL_METRICS_SERVER to something truthy
  3. Start Rails
  4. In log/application_json.log, you should see the line:
    {"severity":"INFO","time":"2022-03-09T13:59:12.887Z","correlation_id":null,"message":"Starting Puma metrics server with pid 78"}
  5. Metrics should be exported from the /metrics endpoint (NOT /-/metrics -- that's a Rails controller). See config for which port is listens on.

This implies the server process is running with the given process ID. You can do some more testing with this such as:

  1. TERM the metrics server, e.g. kill <pid>. This should cause the process supervisor to restart it within 5 seconds or so (see log file for a message).
  2. SIGUSR2 the Puma primary. This should stop the metrics server along with Puma, then restart it.
  3. For the giggles, kill a Puma worker. This should not affect the metrics server.

How to test in Omnibus

I also tested these changes with the image published for QA: fc0074b2ab355ea536a19180c5b7d0fa62a045f4

To enable the server, set the following in /etc/gitlab/gitlab.rb:

gitlab_rails['env'] = {
  "PUMA_EXTERNAL_METRICS_SERVER" => "1"
}

puma['exporter_enabled'] = true
puma['exporter_log_enabled'] = true
puma['exporter_address'] = "127.0.0.1"
puma['exporter_port'] = 8083

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #350548 (closed)

Edited by Matthias Käppler

Merge request reports