Skip to content

Clean up 7_prometheus initializer

Matthias Käppler requested to merge 350548-clean-up-prometheus-initializer into master

What does this MR do and why?

Before you read!

We ended up moving this behind a FF: prometheus_initializer_refactor

The FF defaults to ON; since it is on the boot path, we cannot immediately affect its state, since either way on or off, GitLab pods would have to restart. We therefore decided to let it default to ON, so that the change goes live together with the commit, rather than surprisingly on a subsequent deploy.

Turning the flag off and restarting the app should resolve any potential issues introduced.

Rollout issue: #353446 (closed)

Overview

This is step 1 in what I expect to be 2 MRs eventually.

Our Prometheus metrics initializer 7_prometheus is difficult to understand and maintain, due its overuse of nested if-else logic. As part of #350548 (closed) I need to add even more logic to this in order to start a server process in the Puma primary, so I reckoned it was about time to give this initializer some much needed TLC first.

This initializer's primary complexity stems from it being 2-dimensional in configuration:

  1. Some actions should only fire on particular runtimes (Puma vs Sidekiq).
  2. Some actions should only fire on particular lifecycle-hooks of these runtimes.

I plan to improve on the status quo in two steps:

  1. This MR: Co-locate and simplify related initialization switches such as those that need to fire for the same runtime or life-cycle hooks.
  2. Next MR: Extract code into strategy objects, such that each strategy implementation contains all logic related to a particular runtime (e.g. SidekiqPrometheusInit and PumaPrometheusInit). We only need to switch these once, and these will always be mutually exclusive so would benefit from being defined in separate code modules. Then, these strategy objects will implement all life-cycle hooks they care about if necessary. The initializer will then instantiate the appropriate strategy instance and invoke them.

Change details

The diff is a bit noisy, but I tried to mostly maintain semantics with some minor exceptions that I consider improvements. Here is what I changed:

  • Instead of wrapping the majority of the logic into an if block, I return early now unless:
    1. We are running in a non-test application runtime (see below).
    2. Prometheus metrics are enabled.
  • Instead of just returning in test envs, I extended skipped environments to those that are not actually "application runtimes" An application runtime is either Puma or Sidekiq at the moment, but not, for instance, RSpec, Rails Console, rake, etc. I formalized this as a new method on Runtime. This is an improvement because it does not start samplers and metrics servers when launching the console or rake tasks anymore, which makes no sense.
  • I removed the Runtime.web_server? method. We had introduced this during the migration from Unicorn to Puma, since we ran both side by side and needed some checks to execute on either runtime. The initializer was using a mix of both, so I simplified it to just check for puma? and removed web_server?.
  • I moved some Puma specific logic from life-cycle hooks that executed at the bottom of the file to places where we were already executing in the same combination of "runtime and hook"; these were all calls related to launching or stopping a metrics server thread in the Puma primary. These will go away once we finish #350548 (closed).
  • I added code comments where I thought it would help understand why something was done.

How to set up and validate locally

The main test here should be for: do not regress on anything. Automated tests will verify that we are not crashing in the initializer due to the changes I made.

As for metrics, you can convince yourself they are still recorded and exported by enabling the metrics server for either Puma or Sidekiq:

We can also see that samplers are not started anymore when running e.g. the console:

$ bin/rails c
...
Loading development environment (Rails 6.1.4.6)
[1] pry(main)> Thread.list
=> [#<Thread:0x00007fbb56baede8 run>, #<Thread:0x00007fbb4ea37ad0 /data/cache/bundle-2.7.5/ruby/2.7.0/gems/activerecord-6.1.4.6/lib/active_record/connection_adapters/abstract/connection_pool.rb:323 sleep>]

Previously, this was listing several daemon threads that were metrics *Samplers.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #350548 (closed)

Edited by Matthias Käppler

Merge request reports