Proposal to simplify sidekiq worker pools

Requires https://gitlab.com/gitlab-org/gitlab-ce/issues/64692

Spawned from https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7177

Currently we have a number of different sidekiq priority queues.

Its unclear to me what the key differentiator between the different queues is. I had assumed it was based on throughput - for example realtime for high priority, short jobs, and besteffort for low priority, long running jobs, but this doesn't appear to be the case: for example there are some tasks which take upwards of 2.5 hours which run on the realtime queues.

Once a job is assigned to a priority queue, it will be processed by a fleet of sidekiq workers dedicated to that queue. For example, we have sidekiq fleets for realtime, besteffort etc.

If we look at things on a machine level, each node is running a set of sidekiq worker processes and each worker has a set of threads handling jobs.

At this point there are some more surprises:

Each process has a different number of worker threads (between 3 and 12 per process)
Each process will handle a different set of jobs from the queue

git       8693  2523  0 12:37 ?        00:00:00       ruby /opt/gitlab/embedded/service/gitlab-rails/ee/bin/sidekiq-cluster -e production -r /opt/gitlab/embedded/service/gitlab-rails post_receive,merge,update_merge_requests,gitlab_shell,email_receiver,repository_fork,reactive_caching,project_update_repository_storage,ldap_group_sync,new_issue,new_merge_request update_merge_requests,post_receive process_commit,process_commit,process_commit process_commit,process_commit,process_commit authorized_projects,authorized_projects new_note,new_note merge,merge,update_merge_requests merge,merge,update_merge_requests update_merge_requests,post_receive
git       8700  8693 55 12:37 ?        00:13:32         sidekiq 5.2.7 queues: post_receive, merge, update_merge_requests, gitlab_shell, email_receiver, repository_fork, reactive_caching, project_update_repository_storage, ldap_group_sync, new_issue, new_merge_request [3 of 12 busy]
git       8702  8693 24 12:37 ?        00:05:55         sidekiq 5.2.7 queues: update_merge_requests, post_receive [0 of 3 busy]
git       8704  8693  9 12:37 ?        00:02:18         sidekiq 5.2.7 queues: process_commit (3) [0 of 4 busy]
git       8706  8693  9 12:37 ?        00:02:21         sidekiq 5.2.7 queues: process_commit (3) [0 of 4 busy]
git       8708  8693  7 12:37 ?        00:01:45         sidekiq 5.2.7 queues: authorized_projects (2) [0 of 3 busy]
git       8710  8693  8 12:37 ?        00:02:03         sidekiq 5.2.7 queues: new_note (2) [0 of 3 busy]
git       8712  8693 13 12:37 ?        00:03:12         sidekiq 5.2.7 queues: merge (2), update_merge_requests [1 of 4 busy]
git       8714  8693 13 12:37 ?        00:03:13         sidekiq 5.2.7 queues: merge (2), update_merge_requests [0 of 4 busy]
git       8716  8693 29 12:37 ?        00:07:13         sidekiq 5.2.7 queues: update_merge_requests, post_receive [1 of 3 busy]

This means that some jobs could be saturated by busy workers while other worker processes in the same fleet sit idle.

It also means that we need to be able to manually monitor the fleet and make constant manual adjustments.

Unfortunately, as far as I can tell, we don't have metrics to alert us when all the workers for a certain subset of the fleet are busy.

Instead we will reactively respond when worker queues lengths start climbing.

Proposal

I propose a simpler approach, which should be easier to manage.

Priority queues are strictly based on throughput requirements and job latency.
Each priority queue has strict SLO requirements for latency. If the apdex for a particular job consistently does not meet the required SLO, development teams will be notified and the job will be de-prioritised to a high-latency queue.
Each priority queue will have its own fleet (same as present)
Each worker process will process all jobs for a given priority queue, not a subset
Each worker will have the same number of threads

This approach will be easier to manage and will not require manual adjustment. If the realtime queue is not keeping up with jobs, it can be scaled up to process more. If saturation of worker threads across a fleet drops below a threshold for a certain period, the fleet can be scaled back.

This will also be much simpler to deal with in a k8s world (@skarbek what strategy are we using here?)

Edited Aug 01, 2019 by Andrew Newdigate