Self-managed users experiencing degraded Sidekiq performance due to enabling queue selector and defaulted routing rules
Background
In %15.4, we defaulted routing rules in Rails app, which results in all jobs are routed to default queue. This reduces the number of queues from 400+ (queue per-worker configuration) to 2 queues (default and mailers) for typical self-managed users.
We wrongly assumed that most users are not using queue selector, and since Sidekiq's queue_groups are defaulted to * anyway (https://gitlab.com/gitlab-org/omnibus-gitlab/-/blob/cba33606/files/gitlab-cookbooks/gitlab/attributes/default.rb#L599), the change was deemed safe.
Incident
Ever since the %15.4 release, several Zendesk tickets regarding bad Sidekiq performance start to increase (at least 9 tickets) as reported by the Support team.
List of tickets and summaries
-
https://gitlab.zendesk.com/agent/tickets/336502
- Installation type: Docker
-
Sidekiq settings:
sidekiq['queue_groups'] = [ # Run all non-CPU-bound queues that are high urgency 'resource_boundary!=cpu&urgency=high', # Run all continuous integration and pages queues that are not high urgency 'feature_category=continuous_integration,pages&urgency!=high', # Run all queues '*' ]
-
https://gitlab.zendesk.com/agent/tickets/325926
- Installation type: Omnibus
-
Sidekiq settings:
sidekiq['queue_selector'] = true sidekiq['queue_groups'] = [ "*", "*", "*", "*" ] - Ticket was logged on 14th Sept, before %15.4 is released, this ticket is unrelated. It was resolved by increasing queue_groups
-
https://gitlab.zendesk.com/agent/tickets/340214
- Installation type: Docker
-
Sidekiq settings:
sidekiq['min_concurrency'] = 5 sidekiq['max_concurrency'] = 15 sidekiq['queue_selector'] = true sidekiq['queue_groups'] = [ "urgency=high", "urgency=high", "urgency=low", "urgency=low", "urgency=low", "urgency=throttled", "urgency=throttled", "*" ]
-
https://gitlab.zendesk.com/agent/tickets/337432
- Installation type: Docker
-
Sidekiq settings:
sidekiq['enable'] = true sidekiq['queue_selector'] = true sidekiq['queue_groups'] = [ 'urgency=high', '*']
-
https://gitlab.zendesk.com/agent/tickets/341725
- Installation type: Docker
-
Sidekiq settings:
sidekiq['enable'] = true sidekiq['queue_selector'] = true sidekiq['queue_groups'] = [ # Extra resources to support merge requests 'feature_category=continuous_integration,source_code_management', 'feature_category=continuous_integration,source_code_management', 'feature_category=continuous_integration,source_code_management', 'feature_category=continuous_integration,source_code_management', # Run all continuous integration and pages queues that are high urgency 'feature_category=continuous_integration,pages&urgency=high', # Run all integrations queues 'feature_category=integrations', # Run all non-CPU-bound queues that are high urgency 'resource_boundary!=cpu&urgency=high', # Run all queues '*' ]
-
https://gitlab.zendesk.com/agent/tickets/341178
- Installation type: Docker
- Sidekiq settings: NA (SOS file yet to be provided)
-
https://gitlab.zendesk.com/agent/tickets/342174
- Installation type: Omnibus
-
Sidekiq settings:
Equivalent to
ruby /opt/gitlab/embedded/service/gitlab-rails/bin/sidekiq-cluster -e production -r /opt/gitlab/embedded/service/gitlab-rails -m 10 --timeout 25 *sidekiq['queue_groups'] = ['*']- Shouldn't be affected due to the routing rules change.
-
https://gitlab.zendesk.com/agent/tickets/337993
- Installation type: Omnibus
-
Sidekiq settings:
Equivalent to
ruby /opt/gitlab/embedded/service/gitlab-rails/bin/sidekiq-cluster -e production -r /opt/gitlab/embedded/service/gitlab-rails -m 50 --timeout 25 *sidekiq['queue_groups'] = ['*']- Shouldn't be affected due to the routing rules change.
- https://gitlab.zendesk.com/agent/tickets/334091
Note: 15.4 is released on 22nd September 2022.
Some Sidekiq settings were derived from the ticket or SOS file (if customer didn't provide gitlab.rb)
Common patterns from the incidents
From 15.4, users without routing_rules, use queue_selector and dedicated processes in queue_groups for certain attributes (eg 'resource_boundary!=cpu&urgency=high') experienced slowness in clearing Sidekiq jobs. This happens because the default routing_rules pushes all jobs to default queue (as opposed to a single queue per worker previously), thus Sidekiq processes that are not listening to * will be idle. In other words, only processes in queue_groups with * is working.
Resolution
For affected customers:
-
Quick fix is to advise customers to update
sidekiq['routing_rules'] = [['*', nil]]while still maintaining their queue selectors. -
We can also inform/advise customers to start using custom
routing_rules, translating theirqueue_groupsfrom queue selector to the form ofrouting_rulesmapping. But this could potentially put a lot of work on to Support team, and would be prone to error since they would need to come up with queue names, and list those queue names in thequeue_groupsmanually without usingqueue_selector. (unless gitlab-org/omnibus-gitlab!6289 (closed) is already merged). We would need to update our documentation to encourage the use of routing rules instead of queue selector and explain the relationship between them.- Example:
- With current
gitlab.rbwithqueue_selector:
# sidekiq['routing_rules'] = [] sidekiq['enable'] = true sidekiq['queue_selector'] = true sidekiq['queue_groups'] = [ # Run all non-CPU-bound queues that are high urgency 'resource_boundary!=cpu&urgency=high', # Run all continuous integration and pages queues that are not high urgency 'feature_category=continuous_integration,pages&urgency!=high', # Run all queues '*' ]- Updated
gitlab.rbusingrouting_rulesand disabledqueue_selector:
Note: Users can spin up more processes of a certain queue as before, e.g.sidekiq['routing_rules'] = [ # All non-CPU-bound jobs that are high urgency go to 'urgent_other' queue ['resource_boundary!=cpu&urgency=high', 'urgent_other'], # All continuous integration and pages jobs that are not high urgency go to 'ci_pages' queue ['feature_category=continuous_integration,pages&urgency!=high', 'ci_pages'], # All other jobs go to 'default' queue ['*', 'default'] ] sidekiq['enable'] = true sidekiq['queue_selector'] = false sidekiq['queue_groups'] = [ 'urgent_other', 'ci_pages', 'default' ]sidekiq['queue_groups'] = ['urgent_other', 'urgent_other', 'ci_pages', 'ci_pages', 'default'] - With current
- Example:
For customers yet to upgrade to 15.4:
-
15.6 - Revert default routing rules on rails app - gitlab-org/gitlab!103483 (merged)
2. [ ] 15.6 - Add the logic in Omnibus & charts that sets routing_rules to [['*', nil]] if routing_rules not already set and queue_groups only consists of *s (can be multiple). We don't want to mess up with default routing_rules, when users are touching the queue_groups.
- [ ] gitlab-org/omnibus-gitlab!6511 (closed)
- [ ] TODO: Charts MR
-
Backport the revert in (1) to 15.4 and 15.5 - https://gitlab.com/gitlab-org/release/tasks/-/issues/4522
4. [ ] 15.7 - Reintroduce default routing rules on rails app