Defining Sidekiq scheduling latency SLOs based on queuing priority
The infrastructure team is currently considering ways to improve the resilience of our Sidekiq infrastructure: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7219 In https://gitlab.com/gitlab-org/gitlab-ce/blob/master/config/sidekiq_queues.yml, we assign a priority to each sidekiq queue in the application. As far as I've been able to tell so far, this prioritisation is not currently being used as a basis for the GitLab.com priority queue mechanism, but (if it's not) it should be. One thing that would really help is if we set service-level expectations for each of these priorities. The priorities are: 1. What is the maximum amount of time a job of this priority should wait in a queue before being processed? 1. What is the maximum amount of time a job of this priority should run for. With these two metrics we can ensure that the fleet is sufficiently well provisioned. | Priority | Description | p99 Max Scheduling Time | p99 Max Execution Time | |----------|-------------|---------------------|--------------------| | 1 | low priority | 10 minutes (??) | 3 hours (??) | | 2 | medium priority | 1 minute (??) | 5 minutes (??) | | 3 | high priority | 10 seconds (??) | 1 minute (??) | | 5 | _super_ high priority | 1 second (??) | 20 seconds (??) | I will also try do some analysis on GitLab.com to find out what the actual values we are seeing for queues of each priority are at present. cc @smcgivern Related: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30784
issue