Document GitLab's Sidekiq set-up

As you're probably all aware, we're using Sidekiq to handle long running work loads at GitLab. A good example for such a work load would be an email notification that needs to be dispatched in response to a user action. By moving these long running actions off the request/response path, we can keep latency low.

I hadn't used Sidekiq before coming to GitLab, and while it appears in different sections of the GitLab docs, it wasn't covered in enough detail for me to really understand how it is used and deployed at GitLab. Moreover, the official Sidekiq docs are quite sparse. I therefore created this issue based on personal notes documenting its mode of operation both for development (local) and the gitlab.com environments. Maybe it's useful for other people too.

Sidekiq conceptually

Sidekiq is based on 3 main concepts: workers, jobs, and queues. A worker is a description of an application work load you want to run, such as dispatching an email, encoded as a Ruby class that mixes in the Sidekiq::Worker type. Unfortunately, Sidekiq isn't even internally consistent with its nomenclature(*), as it also calls a thread executing that code a worker. Hereafter I will use worker to refer to the thread running the code defined in the worker class. Moving on. A job is a request to Sidekiq for a worker to instantiate and execute the worker class. So for any one worker class, there can be many jobs executing its instances (e.g. emails sent). Jobs are posted to shared queues, from which sidekiq reads and assigns jobs to the next available worker.

(*)Unfortunately, in the language we have adopted at GitLab, the term 'worker' is usually used to refer to a particular sidekiq process executing worker threads, which makes this extra confusing!

Sidekiq's execution model

At GitLab we run Sidekiq in different modes: as a single process in development and multi-process (also called "clustered") in production (and as part of an omnibus installation IIUC). Each sidekiq process operates a thread pool, with each thread representing a worker that gets work assigned from a job. The number of threads is also referred to as the concurrency level at which a sidekiq process operates. Especially in a multi-process setup, you might be wondering how jobs are distributed between the application enqueing it and sidekiq assigning it a worker, or multiple worker processes reading from a shared queue. This is achieved by storing job queues in Redis, so that multiple sidekiq processes can read from a single queue representation and distribute the work load evenly. This is also why job parameters must be serializable, as jobs are stored in Redis as JSON.

Sidekiq GitLab setup

At GitLab we have dozens of queues; they are defined in config/sidekiq_queues.yml in the main rails repo. The way we assign workers to process jobs in these queues differs between the development and production environments and is further described below.

GDK / development setup

When running gitlab locally with the GDK, sidekiq will run by default as a single process with 10 worker threads. We can see this when running gdk start redis rails-background-jobs to bring up redis and sidekiq (assuming runit is used for process management) and running pstree -atp:

...
  │   │   ├─runsv,1471 rails-background-jobs
  │   │   │   ├─bundle,14559                                                                                                                                        
  │   │   │   │   ├─{base_reliable_*},14627
  │   │   │   │   ├─{connection_poo*},14568
  │   │   │   │   ├─{connection_poo*},14628
  │   │   │   │   ├─{daemon.rb:38},14618
  │   │   │   │   ├─{daemon.rb:38},14640
  │   │   │   │   ├─{util.rb:23},14792
  │   │   │   │   ├─{util.rb:23},14793
  │   │   │   │   ├─{util.rb:23},14794
  │   │   │   │   ├─{util.rb:23},14795
  │   │   │   │   ├─{util.rb:23},14796
...

runsv is the process supervisor for a service called rails-background-job. That service will run a GDK script called sv/rails-background-jobs/run, which in return runs gitlab/bin/background_jobs, which in turn runs bundle exec sidekiq. That's why the process name here is bundle, not sidekiq (although ps will list the process as sidekiq, I'm not sure why.) The elements underneath that node wrapped in curly braces {...} are threads. I'm not sure why it creates two connection pools and two daemon threads, something to look into?

Since there is only one process, all jobs from all queues are being processed by all workers executed by that process.

Production / Omnibus

For *.gitlab.com and for self-managed installations (both of which are deployed from Omnibus packages), we run sidekiq in clustered i.e. multi-process mode. What this does is spin up several sidekiq processes per node, each of which then runs its own thread pool. For gitlab.com, we also have multiple instances i.e. server nodes (Google Compute Engine), each running their own cluster of sidekiq processes. This is illustrated in the diagram below.

Here we have two GCE instances sidekiq-0 and sidekiq-1 that are part of the besteffort "fleet" (is I think what we call it). In return we have 3 queues that are processed by any of these instances. Each instance runs N sidekiq processes, each of which in return runs M threads. So in total this setup can processes at most 2 * N * M jobs in queues with besteffort priority.

How queues are mapped to priorities and ultimately clusters is well explained in this issue and the related links, but the TLDR is:

all sidekiq clusters are configured with a particular priority configuration (i.e. scaled according to a particular work load, such as "mostly CPU bound" or "mostly I/O bound")
each queue is then mapped to a priority configuration, either directly, or if there is no explicit mapping, to besteffort

These configurations and mappings are defined in this file.

All workers within a cluster will execute any job from the queues that they read from.

sidekiq-cluster

To implement a multi-process / clustered setup such as for the environments mentioned above we use a GitLab EE specific script that is part of the Omnibus packages: sidekiq-cluster. It's a wrapper around sidekiq, and its main function is to do the following:

provide a CLI component that takes a grouped list of queues and other options as arguments
interpret and map command line options to sidekiq options and provide reasonable defaults
for every queue group, do:
- spawn bundle exec sidekiq with the queues for that group
- set the concurrency level for that process to num_queues+1, or max_concurrency, whichever is smaller
- spawn a supervisor thread that will await the child process
enter a loop that continuously polls for all child processes to be alive, and exit if one of them dies

For instance, $ee/bin/sidekiq-cluster a b,c will spawn 2 sidekiq processes (say P1 and P2) with the following settings:

P1:

queues: a
concurrency: 2

P2:

queues: b, c
concurrency: 3

Metrics

We currently track sidekiq metrics in prometheus in 3 different places, only 2 of which are still recommended to use:

prometheus-app contains sidekiq_ metrics that are specific to application code executing in sidekiq workers, most importantly anything we export through our sidekiq middleware. Anything that measure per-worker metrics or anything that would rely on the process or thread that is executing our jobs needs to go here.
prometheus-main contains sidekiq_ metrics that track general infrastructure and cluster health and which have meaning outside of the application context.
gitlab-exporter contains a sidekiq specific endpoint that prometheus can scrape for metrics. My understanding is that this exists largely for customers running self-managed environments and should not be used to expose metrics for gitlab.com.

In development mode, all web-app metrics can be fetched from the /-/metrics endpoint, however, the sidekiq middleware will not export metrics by default. You can enable it in gitlab.yml by emabling the sidekiq_exporter:

sidekiq_exporter:
      enabled: true
      address: localhost
      port: 3807

Note that the endpoint for sidekiq is /metrics, not /-/metrics.

Edited Oct 21, 2019 by Matthias Käppler

Admin message