Consolidate & externalize all sidekiq process supervision
Need
Supervision of our sidekiq processes such as the act of automatically shutting them down when they behave abnormally is currently both fragmented and inconsistently implemented. There are two aspects which I see as problematic:
Internal vs external supervision. We currently rely on a mixture of process-internal and process-external approaches to monitor sidekiq. For instance, we use runit
supervisors to monitor sidekiq services and restart them should they shut down unexpectedly. This is a form of external supervision, since runit
lives outside of the GitLab app. However, for killing sidekiq processes, we rely on mostly internal supervision, either by running threads inside of the supervised worker processes, or as part of a parent process that started them (more on that below.) I think that internal supervision is not a good approach to system monitoring, because the monitoring can only ever be as healthy as the process it is monitoring. There is some evidence that this is affecting a premium customer, where a sidekiq process became unresponsive, including the memory killer thread.
Co-location of supervision. We currently spread out monitoring concerns over at least 3 different locations:
-
MemoryKiller
: a daemon that runs alongside worker threads in each sidekiq process and which sends SIGTERM to its parent process when a certain amount of maximum RSS is exceeded. -
sidekiq-cluster
CLI: the process responsible for spawning sidekiq worker processes in a clustered setup runs a control-loop in a thread that monitors worker processes for health; if one goes away, it shuts down all of them. -
sidekiq-cluster
initializer: a Rails initializer script that starts a thread and monitors for its parent PID to change; if so, it kills itself. This is to account for zombies escaping when the cluster parent is SIGKILL'ed, e.g.runsv
, the supervisor itself.
This makes the setup difficult to understand and maintain, and makes monitoring itself unreliable, as it can only ever be as reliable as the process running it. There is also the suspicion that with CPU heavy sidekiq workloads, the GIL might not give the monitoring threads enough CPU time to reliably operate (although that might be fixable with setting thread priority hints.) Conversely, due to the GIL, any CPU cycles spent in monitoring threads are CPU cycles that we definitively steal from worker threads i.e. application work loads running in the same process. This would not happen were we to run monitors in a separate process, as those can utilize other cores.
Approach
I propose to consolidate all sidekiq supervision concerns (outside of runit) into a single process that runs parallel to sidekiq. This new process would:
- run a single control loop in a single thread, which
- monitors process RSS for all sidekiq workers
- monitors cluster health
- monitors zombiness
- issue SIGTERM/SIGKILL in case any of the conditions monitored for are violated
Potential process tree:
- runsv sidekiq-cluster
- sidekiq-cluster (CLI) <pgroup>
- sidekiq-worker-0
- sidekiq-worker-1
- sidekiq-sv
where sidekiq-sv
is the new supervisor. Alternatively, the supervisor could run directly under runsv
as a separate service.
Benefits
The approach outlined above has two major benefits:
- Co-location of all concerns related to sidekiq process health, making the code involved easier to understand and maintain
- Detached monitor life-cycle from the individual processes being monitored, making it more reliable and less susceptible to knock-on effects of the processes themselves choking on something
Competition
Move all supervision logic into sidekiq-cluster itself
An alternative approach could be to simply co-locate all existing supervision features into the sidekiq-cluster
script itself, since it already runs a control loop that monitors sidekiq workers, which might be a smaller increment compared to the solution above.
A drawback of the approach described is that if the supervisor is being externalized, it will not have the same level of insight into process internals as some of the current daemon threads do. For instance, MemoryKiller
collects process-internal data and submits it to prometheus. I personally think (and had commented on the respective MR as well) that this is actually a separate concern from out-of-memory sniping of processes anyway, and should probably move out of that class.
Rely on kubernetes and pod management for process supervision
A longer term goal is to make GitLab run on kubernetes (k8s). Since k8s operates on pods, not processes, the pod is the natural "unit under supervision" and at least some supervising functions that operate on the process level become obsolete. For instance, a sidekiq cluster (logically speaking) would turn from "physical node with N processes" to "physical node(s) with N pods", where each pod would run a single-process sidekiq instance. That would drastically simplify the amount of management we have to do ourselves and leave that to k8s itself.
One problem is that it is unlikely we will be able to target all platforms and customers with a k8s based deployment. For instance, small development teams running a single-node Omnibus installation, or low resource platforms such as the RaspberryPi would not benefit from these improvements, as it's unlikely they would run k8s. We might have to decide on a case by case basis which supervision functions are absolutely mandatory under all circumstances, and those might have to continue to live on the process level.