Document canonical Sidekiq routing rules for reference architectures
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
A customer ran into high Redis CPU utilization when upgrading from Gitlab v16.3 to v16.6. I suspect the increased number of queues (45) may be causing this CPU saturation. I count 676 queues being watched in the latest nightly. 2 years ago when @cmiskell wrote https://about.gitlab.com/blog/2021/09/02/specialized-sidekiq-configuration-lessons-from-gitlab-dot-com/ was written, we "only" had 440 queues.
In https://docs.gitlab.com/ee/administration/sidekiq/extra_sidekiq_processes.html, we document a naive setup:
sidekiq['queue_groups'] = ['*'] * 4
However, even for one node this is particularly hard on Redis because each of the 4 processes runs 50 threads and issues a BRPOP with over 650 queues!
In https://docs.gitlab.com/ee/administration/sidekiq/processing_specific_job_classes.html#migrating-from-queue-selectors-to-routing-rules, we provide a sample configuration for routing rules:
sidekiq['min_concurrency'] = 20
sidekiq['max_concurrency'] = 20
sidekiq['routing_rules'] = [
['urgency=high', 'high_urgency'],
['urgency=low', 'low_urgency'],
['urgency=throttled', 'throttled_urgency'],
# Wildcard matching, route the rest to `default` queue
['*', 'default']
]
sidekiq['queue_selector'] = false
sidekiq['queue_groups'] = [
'high_urgency',
'low_urgency',
'throttled_urgency',
'default,mailers'
]
Whereas on GitLab.com, we've have these settings in https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/blob/df779737f3af84a92f14a2ec1cd32a810315dd34/releases/gitlab/values/gprd.yaml.gotmpl#L959-969:
- ["worker_name=AuthorizedProjectUpdate::UserRefreshFromReplicaWorker,AuthorizedProjectUpdate::UserRefreshWithLowUrgencyWorker", "quarantine"] # move this to the quarantine shard
- ["worker_name=AuthorizedProjectsWorker", "urgent_authorized_projects"] # urgent-authorized-projects
- ["resource_boundary=cpu&urgency=high", "urgent_cpu_bound"] # urgent-cpu-bound
- ["resource_boundary=memory", "memory_bound"] # memory-bound
- ["feature_category=global_search&urgency=throttled", "elasticsearch"] # elasticsearch
- ["resource_boundary!=cpu&urgency=high", "urgent_other"] # urgent-other
- ["resource_boundary=cpu&urgency=default,low", "low_urgency_cpu_bound"] # low-urgency-cpu-bound
- ["feature_category=database&urgency=throttled", "database_throttled"] # database-throttled
- ["feature_category=gitaly&urgency=throttled", "gitaly_throttled"] # gitaly-throttled
- ["*", "default"] # catchall on k8s
Yet another customer in https://gitlab.com/gitlab-org/distribution/team-tasks/-/issues/1422#note_1690949576 has an even more surprising config: only two queues, one for search and one for another for everything else!
sidekiq['max_concurrency'] = '25'
sidekiq['routing_rules'] = [
["feature_category=global_search", "global_search"],
['*', 'default'],
]
sidekiq['queue_groups'] = [
'global_search',
'global_search',
'global_search',
'global_search',
'global_search'
]
I think we need to make it clearer the naive setup of all queues is no longer recommended, but we should also document a canonical example that will work for most installations.
The docs in https://docs.gitlab.com/ee/administration/sidekiq/processing_specific_job_classes.html#migrating-from-queue-selectors-to-routing-rules seem to be a good starting point, but I wonder if we should take some of the learnings from GitLab.com and tune this?
@cmiskell, @engwan, @qmnguyen0711, @grantyoung What do you think?