Skip to content

Fix missing metrics for Sidekiq exporter server

What does this MR do and why?

Fixes missing sidekiq_exporter_ruby_* metrics for the Sidekiq exporter server process.

We were accidentally deleting metrics the exporter server exported about itself. Actual sidekiq worker metrics were not affected, nor are request metrics for this server. Furthermore, this only happens when running a dedicated metrics server process for Sidekiq, which is not yet the default, but is the case for SaaS since we're dog-fooding it.

This happened because we would wipe the metrics dir whenever sidekiq_0 starts up, but this happens after the metrics server starts, so the worker was deleting those existing metrics.

The fix moves the cleanup logic out of the shared 7_prometheus initializer and into the CLI wrapper script for Sidekiq. This is now possible to do because we started using this in SaaS, despite there only being a single Sidekiq worker. We can think of this script as the "Sidekiq supervisor", so any work that affects all Sidekiq processes should happen here, prior to starting them up.

I had to refactor CleanupMultiprocDirService slightly to take the target directory as an argument since otherwise, initialization might enter a race with Prometheus::Client being configured first, which isn't always the case.

Screenshots or screen recordings

Screenshot_from_2022-03-22_13-57-00

How to set up and validate locally

  1. Enable the sidekiq_exporter (it's a monitoring settings)
  2. Make sure sidekiq_exporter and sidekiq_health_checks are not sharing a port (since this will start the metrics server in legacy mode, and the bug will not surface.)
  3. Start Sidekiq (via sidekiq-cluster)
  4. A metrics server should appear as the sidekiq-exporter process
  5. curl its /metrics endpoint
  6. You should see metrics starting in sidekiq_exporter_* (these were previously AWOL)

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #356505 (closed)

Edited by Matthias Käppler

Merge request reports