Bring `sidekiq-cluster` script to Core

Problem Statement

Currently, running sidekiq in clustered mode (i.e. spawning more than one worker process) is technically only available as part of GitLab EE distributions, and for self-managed environments only in the Starter+ tiers. Because of that, when booting sidekiq up in a development env with the GDK, the least common denominator is assumed, which is to run sidekiq in a single-process setup. That can be a problem, because it means there is a divergence between the environment developers work on, and what will actually run in production (i.e. gitlab.com and higher-tier self-managed envs). We have already seen problems in production that went unnoticed and are specific to a multi-process sidekiq setup, such as race conditions between worker processes to initialize prometheus, as well as initialization races leading to crashes.

We should consider:

Making sidekiq-cluster available as part of Core and make it a 1-process "cluster"
In development, use sidekiq-cluster to run sidekiq locally (e.g. with 2 processes)

This would include changing the background_jobs script we use to boot up sidekiq locally to utilize sidekiq-cluster.

That way we get better coverage of code that is actually used by the majority of GitLab deployments, particularly gitlab.com. I also see this being aligned with the recent change to focus on availability over velocity. An open question is how to restrict users who are not eligible to run more than 1 sidekiq process from actually doing so.

Reach

Personas:

Sasha (engineer) - because they get more confidence in making changes to sidekiq and testing them in an environment closer to production
Devon (devops) - because it simplifies how we operate sidekiq in any environment

Reach 10.0 = Impacts the vast majority (~80% or greater) of our users, prospects, or customers.

Impact

2.0 = High impact

Confidence

80% = Medium confidence

Effort

Medium to High. Below is a break down of what I think would need to happen, but we can roll this out in multiple stages.

Terms:

1P = single-process nP = clustered setup

GitLab (the app)

Roughly in this order (all of this needs to happen):

Find a way to consolidate queue configuration between 1P and nP setups, since they use vastly different approaches currently
Start using sidekiq-cluster via GDK. This should be a relatively straight-forward first step towards alignment and putting sidekiq-cluster on a hot-path for ongoing development.
Revert changes in !11001 (merged) to move the script back under bin/
Rewrite bin/background_jobs wrap bin/sidekiq-cluster
Update any remaining documentation if necessary

Omnibus

I think these steps can happen incrementally, and only the first one is necessary for an MVC (minimum viable change):

Revert changes in omnibus-gitlab!3216 (merged)
Use sidekiq-cluster by default (it's currently disabled by default). This means that we should probably also provide sensible defaults for queue grouping, which we don't currently. We ask the user to manually fill in sidekiq_cluster['queue_groups'] instead. See also "consolidate queue config" above.
Deprecate or remove sv-sidekiq-run i.e. 1P setups. If you want a single process, just create a cluster from 1 queue group. With a better approach to configuring sidekiq-cluster, this should be simple to do.

"From Source" installations

We also allow users to install and run GitLab from source. This comes with a significant amount of configuration and setup overhead for users, but it is an option we provide. These users would be affected by this change, because they use the bin/background_jobs script as part of an init.d supervision script we provide, and which under the current proposal would now launch sidekiq-cluster instead. This may or may not be a drop-in replacement, depending on how much the background_jobs script will have to change and/or if additional variables need to be set e.g. through the environment.

GDK

If we keep bin/background_jobs to simply point to sidekiq-cluster then I don't think any changes are required here. Otherwise we'd have to change the service run script to point directly to sidekiq-cluster.

Related Product issue: https://gitlab.com/gitlab-com/Product/issues/574

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Edited May 31, 2022 by 🤖 GitLab Bot 🤖