Export Sidekiq metrics from separate process
## Problem We currently both collect and emit Prometheus metrics in Sidekiq from Sidekiq workers processes themselves. This is because there is no primary process in Sidekiq: when running `sidekiq-cluster` as we now do everywhere, all workers are created equal, and all workers will go through the same start-up logic including initializers that register logic to collect and emit metrics. For _collecting_ metrics this is fine, as they will be written to process-specific database files. For _emitting_ metrics this is not fine, since only one such process will succeed to bind to the configured exporter port, whereas all remaining processes will attempt and [fail to bind to the same port](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/metrics/exporter/sidekiq_exporter.rb#L32), resulting in a race condition and a random election of which worker will be scraped by Prometheus. This behavior has resulted in data loss in the past, where workers would accidentally remove each other's database files (https://gitlab.com/gitlab-org/gitlab/-/issues/37387, https://gitlab.com/gitlab-org/gitlab/-/issues/336311). Moreover, serving metrics from the worker itself can be costly and lead to CPU contention, since the metrics server competes for CPU with the worker threads, which is especially problematic for CPU bound Sidekiq jobs. ## Proposal The two main goals we are trying to accomplish: 1. **Move metrics exporter logic to a separate process.** With `sidekiq-cluster` we run a lightweight parent process that supervises all Sidekiq workers. Instead of workers racing to bind the exporter port, we should consider moving [`SidekiqExporter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/metrics/exporter/sidekiq_exporter.rb) into a new process instead, which will be supervised by `sidekiq-cluster` just as workers are now. This will provide better fault and resource isolation. 1. **Split health-check endpoints from metrics endpoints.** This will provide better separation of concerns and will allow us to decide which Sidekiq worker serves health-checks upfront, rather than resulting in race conditions to allocate a port. This should also fix https://gitlab.com/gitlab-org/gitlab/-/issues/5714 where workers deleted each other's metrics. ![sidekiq-cluster_1.resized](/uploads/78bfa941dc98833c4fc29dddfce39a68/sidekiq-cluster_1.resized.jpg) ## Rollout plan ### Now - [x] Split health-check and metrics settings keys; this will allow us to keep an in-process server running in a Sidekiq worker to serve health-checks, while using a separate set of settings to configure metrics, which will require allocating a different port - [x] App-level split: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/74875 - [x] Omnibus split: https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/5743 - [x] Charts split: https://gitlab.com/gitlab-org/charts/gitlab/-/merge_requests/2272 - [x] Run the new metrics server (spawned from the `sidekiq-cluster` parent process) alongside the existing in-worker server; it will export the same metrics as the server we start on a random worker. It is disabled by default. Relevant issues: - Splitting the server out of the worker: https://gitlab.com/gitlab-org/gitlab/-/issues/345887 - Enabling the server using a code switch: https://gitlab.com/gitlab-org/gitlab/-/issues/346256 - [x] In SaaS configmaps, set the `port` value for health-checks to a different value than for the exporter; this will make the server process start up and connect to a different port than the in-worker server. - [x] Point k8s to the new health-check endpoint; this will shift scraper traffic from the worker endpoint to the new server endpoint. - [x] Let it run for a few days to see whether this has any impact on node resource utilization or stability ### Future Since existing users might still have their config set up in a way that expects both metrics and health-checks be served from the same port, we cannot immediately yank endpoints from either Sidekiq workers or the new metrics server. We will instead announce that using `sidekiq_exporter` settings to configure Sidekiq health checks is now deprecated, give users 3 months or so to adjust and use the new settings keys, and then: - [x] Move `/liveness` and `/readiness` endpoints from `SidekiqExporter` into a separate `HealthChecks` code module. Sidekiq workers will only serve those now: https://gitlab.com/gitlab-org/gitlab/-/issues/345804 - [x] Remove any left-overs or unnecessary workarounds such as wrapper scripts cleaning up the metrics directory: - https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6481 <!-- triage-serverless v3 PLEASE DO NOT REMOVE THIS SECTION --> *This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.* <!-- triage-serverless v3 PLEASE DO NOT REMOVE THIS SECTION -->
epic