Proposal to simplify sidekiq worker pools
Requires https://gitlab.com/gitlab-org/gitlab-ce/issues/64692
Spawned from https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7177
Currently we have a number of different sidekiq priority queues.
Its unclear to me what the key differentiator between the different queues is. I had assumed it was based on throughput - for example realtime
for high priority, short jobs, and besteffort
for low priority, long running jobs, but this doesn't appear to be the case: for example there are some tasks which take upwards of 2.5 hours which run on the realtime
queues.
Once a job is assigned to a priority queue, it will be processed by a fleet of sidekiq workers dedicated to that queue. For example, we have sidekiq fleets for realtime
, besteffort
etc.
If we look at things on a machine level, each node is running a set of sidekiq worker processes and each worker has a set of threads handling jobs.
At this point there are some more surprises:
- Each process has a different number of worker threads (between 3 and 12 per process)
- Each process will handle a different set of jobs from the queue
git 8693 2523 0 12:37 ? 00:00:00 ruby /opt/gitlab/embedded/service/gitlab-rails/ee/bin/sidekiq-cluster -e production -r /opt/gitlab/embedded/service/gitlab-rails post_receive,merge,update_merge_requests,gitlab_shell,email_receiver,repository_fork,reactive_caching,project_update_repository_storage,ldap_group_sync,new_issue,new_merge_request update_merge_requests,post_receive process_commit,process_commit,process_commit process_commit,process_commit,process_commit authorized_projects,authorized_projects new_note,new_note merge,merge,update_merge_requests merge,merge,update_merge_requests update_merge_requests,post_receive
git 8700 8693 55 12:37 ? 00:13:32 sidekiq 5.2.7 queues: post_receive, merge, update_merge_requests, gitlab_shell, email_receiver, repository_fork, reactive_caching, project_update_repository_storage, ldap_group_sync, new_issue, new_merge_request [3 of 12 busy]
git 8702 8693 24 12:37 ? 00:05:55 sidekiq 5.2.7 queues: update_merge_requests, post_receive [0 of 3 busy]
git 8704 8693 9 12:37 ? 00:02:18 sidekiq 5.2.7 queues: process_commit (3) [0 of 4 busy]
git 8706 8693 9 12:37 ? 00:02:21 sidekiq 5.2.7 queues: process_commit (3) [0 of 4 busy]
git 8708 8693 7 12:37 ? 00:01:45 sidekiq 5.2.7 queues: authorized_projects (2) [0 of 3 busy]
git 8710 8693 8 12:37 ? 00:02:03 sidekiq 5.2.7 queues: new_note (2) [0 of 3 busy]
git 8712 8693 13 12:37 ? 00:03:12 sidekiq 5.2.7 queues: merge (2), update_merge_requests [1 of 4 busy]
git 8714 8693 13 12:37 ? 00:03:13 sidekiq 5.2.7 queues: merge (2), update_merge_requests [0 of 4 busy]
git 8716 8693 29 12:37 ? 00:07:13 sidekiq 5.2.7 queues: update_merge_requests, post_receive [1 of 3 busy]
This means that some jobs could be saturated by busy workers while other worker processes in the same fleet sit idle.
It also means that we need to be able to manually monitor the fleet and make constant manual adjustments.
Unfortunately, as far as I can tell, we don't have metrics to alert us when all the workers for a certain subset of the fleet are busy.
Instead we will reactively respond when worker queues lengths start climbing.
Proposal
I propose a simpler approach, which should be easier to manage.
- Priority queues are strictly based on throughput requirements and job latency.
- Each priority queue has strict SLO requirements for latency. If the apdex for a particular job consistently does not meet the required SLO, development teams will be notified and the job will be de-prioritised to a high-latency queue.
- Each priority queue will have its own fleet (same as present)
- Each worker process will process all jobs for a given priority queue, not a subset
- Each worker will have the same number of threads
This approach will be easier to manage and will not require manual adjustment. If the realtime
queue is not keeping up with jobs, it can be scaled up to process more. If saturation of worker threads across a fleet drops below a threshold for a certain period, the fleet can be scaled back.
This will also be much simpler to deal with in a k8s world (@skarbek what strategy are we using here?)