Bring `sidekiq-cluster` script to Core
Currently, running sidekiq in clustered mode (i.e. spawning more than one worker process) is technically only available as part of GitLab EE distributions, and for self-managed environments only in the Starter+ tiers. Because of that, when booting sidekiq up in a development env with the GDK, the least common denominator is assumed, which is to run sidekiq in a single-process setup. That can be a problem, because it means there is a divergence between the environment developers work on, and what will actually run in production (i.e. gitlab.com and higher-tier self-managed envs). We have already seen problems in production that went unnoticed and are specific to a multi-process sidekiq setup, such as race conditions between worker processes to initialize prometheus, as well as initialization races leading to crashes.
We should consider:
sidekiq-clusteravailable as part of Core and make it a 1-process "cluster"
- In development, use
sidekiq-clusterto run sidekiq locally (e.g. with 2 processes)
This would include changing the background_jobs script we use to boot up sidekiq locally to utilize
That way we get better coverage of code that is actually used by the majority of GitLab deployments, particularly gitlab.com. I also see this being aligned with the recent change to focus on availability over velocity. An open question is how to restrict users who are not eligible to run more than 1 sidekiq process from actually doing so.
- Sasha (engineer) - because they get more confidence in making changes to sidekiq and testing them in an environment closer to production
- Devon (devops) - because it simplifies how we operate sidekiq in any environment
Reach 10.0 = Impacts the vast majority (~80% or greater) of our users, prospects, or customers.
2.0 = High impact
80% = Medium confidence
Medium to High. Below is a break down of what I think would need to happen, but we can roll this out in multiple stages.
1P = single-process nP = clustered setup
GitLab (the app)
Roughly in this order (all of this needs to happen):
- Find a way to consolidate queue configuration between 1P and nP setups, since they use vastly different approaches currently
- Start using sidekiq-cluster via GDK. This should be a relatively straight-forward first step towards alignment and putting sidekiq-cluster on a hot-path for ongoing development.
- Revert changes in !11001 (merged) to move the script back under
- Update any remaining documentation if necessary
I think these steps can happen incrementally, and only the first one is necessary for an MVC (minimum viable change):
- Revert changes in omnibus-gitlab!3216 (merged)
sidekiq-clusterby default (it's currently disabled by default). This means that we should probably also provide sensible defaults for queue grouping, which we don't currently. We ask the user to manually fill in
sidekiq_cluster['queue_groups']instead. See also "consolidate queue config" above.
- Deprecate or remove
sv-sidekiq-runi.e. 1P setups. If you want a single process, just create a cluster from 1 queue group. With a better approach to configuring sidekiq-cluster, this should be simple to do.
"From Source" installations
We also allow users to install and run GitLab from source. This comes with a significant amount of configuration and setup overhead for users, but it is an option we provide. These users would be affected by this change, because they use the
bin/background_jobs script as part of an
init.d supervision script we provide, and which under the current proposal would now launch
sidekiq-cluster instead. This may or may not be a drop-in replacement, depending on how much the
background_jobs script will have to change and/or if additional variables need to be set e.g. through the environment.
If we keep
bin/background_jobs to simply point to sidekiq-cluster then I don't think any changes are required here. Otherwise we'd have to change the service run script to point directly to sidekiq-cluster.
Related Product issue: https://gitlab.com/gitlab-com/Product/issues/574