Skip to content

Defining Sidekiq scheduling latency SLOs based on queuing priority

The infrastructure team is currently considering ways to improve the resilience of our Sidekiq infrastructure: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7219

In https://gitlab.com/gitlab-org/gitlab-ce/blob/master/config/sidekiq_queues.yml, we assign a priority to each sidekiq queue in the application.

As far as I've been able to tell so far, this prioritisation is not currently being used as a basis for the GitLab.com priority queue mechanism, but (if it's not) it should be.

One thing that would really help is if we set service-level expectations for each of these priorities.

The priorities are:

  1. What is the maximum amount of time a job of this priority should wait in a queue before being processed?
  2. What is the maximum amount of time a job of this priority should run for.

With these two metrics we can ensure that the fleet is sufficiently well provisioned.

Priority Description p99 Max Scheduling Time p99 Max Execution Time
1 low priority 10 minutes (??) 3 hours (??)
2 medium priority 1 minute (??) 5 minutes (??)
3 high priority 10 seconds (??) 1 minute (??)
5 super high priority 1 second (??) 20 seconds (??)

I will also try do some analysis on GitLab.com to find out what the actual values we are seeing for queues of each priority are at present.

cc @smcgivern

Related: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30784