Defining Sidekiq scheduling latency SLOs based on queuing priority
The infrastructure team is currently considering ways to improve the resilience of our Sidekiq infrastructure: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7219
In https://gitlab.com/gitlab-org/gitlab-ce/blob/master/config/sidekiq_queues.yml, we assign a priority to each sidekiq queue in the application.
As far as I've been able to tell so far, this prioritisation is not currently being used as a basis for the GitLab.com priority queue mechanism, but (if it's not) it should be.
One thing that would really help is if we set service-level expectations for each of these priorities.
The priorities are:
- What is the maximum amount of time a job of this priority should wait in a queue before being processed?
- What is the maximum amount of time a job of this priority should run for.
With these two metrics we can ensure that the fleet is sufficiently well provisioned.
Priority | Description | p99 Max Scheduling Time | p99 Max Execution Time |
---|---|---|---|
1 | low priority | 10 minutes (??) | 3 hours (??) |
2 | medium priority | 1 minute (??) | 5 minutes (??) |
3 | high priority | 10 seconds (??) | 1 minute (??) |
5 | super high priority | 1 second (??) | 20 seconds (??) |
I will also try do some analysis on GitLab.com to find out what the actual values we are seeing for queues of each priority are at present.
cc @smcgivern
Related: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30784