Break out Sidekiq process supervision into library module (!80617) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 350548-process-supervision-module into master Feb 14, 2022

What does this MR do and why?

In &6409 (closed) we extracted a server process from sidekiq-cluster that serves metrics into Prometheus. We are looking to do the same for Puma, where the metrics server still runs in the Puma primary process: &7304 (closed).

When running a separate server process, it will have its own life-cycle, and it must be kept alive as long as Sidekiq or Puma are alive. For sidekiq-cluster we already had a mechanism for process supervision in the sidekiq-cluster script itself, since it had to observe potentially multiple Sidekiq worker processes. This script existed because there is no primary Sidekiq process we can fork from (Sidekiq is not a pre-fork system. We just start it several times if necessary.)

With Puma this is not the case. It is a pre-fork server, and there is a dedicated primary that observes worker processes. However, it would be unaware of the metrics server process, so we need a different supervisor to handle this.

With this MR I decided to extract the existing supervision logic from sidekiq-cluster into a library module, ProcessSupervisor, which both sidekiq-cluster (now) and Puma (soon) can use to monitor their "process herd" and act on some or all observed processes should they go away.

A non-goal of this MR is to actually use this logic outside of sidekiq-cluster yet. It is therefore just a refactor for now.

Notes to reviewers

I wasn't able to feature-flag this because sidekiq-cluster does not have access to the database; it is a plain old Ruby script that is a container entrypoint. So unfortunately we need to fully rely on manual testing, automated testing, and pre-production checks here.
I did not see a way to break this MR up more; moving just some code out of CLI would have not been helpful as an intermediate step, and would have made testing harder.

How to set up and validate locally

There are many different scenarios we could test for, but the primary one is verifying that sidekiq-cluster will:

Restart the metrics server should it die (and as long as the cluster isn't currently shutting down).
Terminate all processes, workers and metrics-server, if any of the workers die (but not vice versa; we never want the metrics server to induce worker death!)

First scenario: metrics-server goes away

Make sure sidekiq_exporter is enabled and configured: https://docs.gitlab.com/ee/administration/sidekiq.html#configure-the-sidekiq-metrics-server
Start sidekiq (via the sidekiq-cluster script, not via bundle exec)

You should see a message in the Sidekiq logs like this:

sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:29:05.853Z","message":"Starting cluster with 2 processes"}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:29:05.871Z","message":"Starting metrics server on port 3807"}

Kill the metrics server:

$ kill $(pidof sidekiq_exporter)

You should see:

sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:31:25.998Z","message":"Metrics server terminated, restarting..."}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:31:25.998Z","message":"Starting metrics server on port 3807"}

Second scenario: worker goes away

Start sidekiq (via the sidekiq-cluster script, not via bundle exec)

You should see a message in the Sidekiq logs like this:

sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:29:05.853Z","message":"Starting cluster with 2 processes"}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:29:05.871Z","message":"Starting metrics server on port 3807"}

Kill one of the workers:

$ pgrep -f 'sidekiq 6.4.0'
72
74
$ kill 74

You should see something like:

sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:24.859Z","message":"Shutting down"}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:24.860Z","message":"Scheduler exiting..."}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:24.860Z","message":"Terminating quiet workers"}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:24.860Z","message":"Scheduler exiting..."}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:25.361Z","message":"Pausing to allow workers to finish..."}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:26.864Z","message":"Bye!"}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:31.174Z","message":"A worker terminated, shutting down the cluster"}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:31.175Z","message":"Shutting down"}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:31.175Z","message":"Scheduler exiting..."}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:31.175Z","message":"Terminating quiet workers"}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:31.175Z","message":"Scheduler exiting..."}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:31.677Z","message":"Pausing to allow workers to finish..."}
sidekiq_1           | {"severity":"INFO","time":"2022-02-17T14:34:33.179Z","message":"Bye!"}

At this point, nothing should be running anymore.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #350548 (closed)

Edited Feb 17, 2022 by Matthias Käppler

Break out Sidekiq process supervision into library module