Separate Sidekiq metrics and health-checks server (!85815) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 345804-extract-sk-health-checks into master Apr 25, 2022

What does this MR do and why?

Prior to 15.0, metrics and health-checks endpoints for Sidekiq were coupled in a single server.

This is problematic since we are looking to serve metrics out-of-process, but health checks must be served from the process that is being observed. It has also led to confusing bugs, since we have a UI toggle that turns both off together, and users were confused when they enabled both the metrics exporter and health checks in gitlab.yml but had the toggle turned off, which led to Kubernetes thinking Sidekiq was down because the health check started to fail.

So we decided to split these endpoints (/metrics for metrics and /readiness and /liveness for health checks) apart.

We had already separated settings keys for these two concerns previously, but internally unless two separate ports were set, it would silently still serve both from one server. In this MR, we are doing the actual code split, which is a breaking change. You must configure separate ports now, and we will start separate servers for each.

Related to #345804 (closed)
Epic: &6409 (closed)

NOTE: Since there is logic in Omnibus that still assumes there is only a single server for all these endpoints, we should merge omnibus-gitlab!6065 (merged) first. The respective Helm changes are already done.

NOTE: For SaaS this will be a no-op semantically since the server split at the config level has already happened. It will now just remove the mutually unused endpoints from each respective server.

Implementation

To understand these changes, I should explain the following:

Previously, both Sidekiq metrics and health checks were served by a class called BaseExporter, which is subclassed to SidekiqExporter for Sidekiq (and WebExporter for Puma but it is not affected by this change). As the name suggests, its original intention was to serve Prometheus metrics.
BaseExporter starts a WEBrick server instance that listens on a background thread. Until now, it included a HealthCheckMiddleware to also service health checks.
With this MR, I removed this middleware from BaseExporter, so all *Exporters now only register /metrics. Note that this doesn't affect Puma because it uses health_controller.rb for health checking instead, not the exporters.
I moved this middleware verbatim into lib/health_checks/middleware.rb.
I added lib/health_checks/server.rb which uses this middleware. The server implementation is very similar to BaseExporter but somewhat simpler. I decided not to share code, because we are actually looking to replace the *Exporter completely with a Golang implementation, but that is still in development in parallel. But expect that code to go away soon.
The Sidekiq health check server is started in the sidekiq.rb initializer (it was previously in 7_prometheus when it was still part of the metrics exporter class.) This also fixes the problem that its launch was accidentally guarded by a check for prometheus_metrics_enabled?, which is controlled by the aforementioned UI toggle.
Because defaulting the server settings to the exporter is not safe anymore (since it would result in port clashes), I also changed 1_settings.rb to not do this anymore.
There was some logic in the sidekiq-cluster process wrapper that would only start an external metrics server when its port does not clash with the health check server. Since we default them to separate ports now, and since we also add logic to Omnibus and Helm to flag port clashes during configuration, we don't need this anymore.

How to set up and validate locally

In gitlab.yml, Sidekiq health-checks must be enabled. Find the monitoring section to enable them:

  ## Monitoring
  # Built in monitoring settings
  monitoring:
    sidekiq_health_checks:
      enabled: true

The default port is 8092 but you can change it.

Next, start Sidekiq and wait for it to fully boot. To verify the server is running, just curl one of the health endpoints, e.g.:

$ curl localhost:8092/readiness

It should respond with 200 OK and a "success" JSON message.

You can also verify that the metrics server should not serve health checks anymore. If enabled, it runs on 8082 by default. It should now only serve /metrics.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited May 04, 2022 by Matthias Käppler

Separate Sidekiq metrics and health-checks server

What does this MR do and why?

Implementation

How to set up and validate locally

MR acceptance checklist

Merge request reports