Self-managed users experiencing degraded Sidekiq performance due to enabling queue selector and defaulted routing rules
Background
In %15.4, we defaulted routing rules in Rails app, which results in all jobs are routed to default
queue. This reduces the number of queues from 400+ (queue per-worker configuration) to 2 queues (default and mailers) for typical self-managed users.
We wrongly assumed that most users are not using queue selector, and since Sidekiq's queue_groups
are defaulted to *
anyway (https://gitlab.com/gitlab-org/omnibus-gitlab/-/blob/cba33606/files/gitlab-cookbooks/gitlab/attributes/default.rb#L599), the change was deemed safe.
Incident
Ever since the %15.4 release, several Zendesk tickets regarding bad Sidekiq performance start to increase (at least 9 tickets) as reported by the Support team.
List of tickets and summaries
-
https://gitlab.zendesk.com/agent/tickets/336502
- Installation type: Docker
-
Sidekiq settings:
sidekiq['queue_groups'] = [ # Run all non-CPU-bound queues that are high urgency 'resource_boundary!=cpu&urgency=high', # Run all continuous integration and pages queues that are not high urgency 'feature_category=continuous_integration,pages&urgency!=high', # Run all queues '*' ]
-
https://gitlab.zendesk.com/agent/tickets/325926
- Installation type: Omnibus
-
Sidekiq settings:
sidekiq['queue_selector'] = true sidekiq['queue_groups'] = [ "*", "*", "*", "*" ]
- Ticket was logged on 14th Sept, before %15.4 is released, this ticket is unrelated. It was resolved by increasing queue_groups
-
https://gitlab.zendesk.com/agent/tickets/340214
- Installation type: Docker
-
Sidekiq settings:
sidekiq['min_concurrency'] = 5 sidekiq['max_concurrency'] = 15 sidekiq['queue_selector'] = true sidekiq['queue_groups'] = [ "urgency=high", "urgency=high", "urgency=low", "urgency=low", "urgency=low", "urgency=throttled", "urgency=throttled", "*" ]
-
https://gitlab.zendesk.com/agent/tickets/337432
- Installation type: Docker
-
Sidekiq settings:
sidekiq['enable'] = true sidekiq['queue_selector'] = true sidekiq['queue_groups'] = [ 'urgency=high', '*']
-
https://gitlab.zendesk.com/agent/tickets/341725
- Installation type: Docker
-
Sidekiq settings:
sidekiq['enable'] = true sidekiq['queue_selector'] = true sidekiq['queue_groups'] = [ # Extra resources to support merge requests 'feature_category=continuous_integration,source_code_management', 'feature_category=continuous_integration,source_code_management', 'feature_category=continuous_integration,source_code_management', 'feature_category=continuous_integration,source_code_management', # Run all continuous integration and pages queues that are high urgency 'feature_category=continuous_integration,pages&urgency=high', # Run all integrations queues 'feature_category=integrations', # Run all non-CPU-bound queues that are high urgency 'resource_boundary!=cpu&urgency=high', # Run all queues '*' ]
-
https://gitlab.zendesk.com/agent/tickets/341178
- Installation type: Docker
- Sidekiq settings: NA (SOS file yet to be provided)
-
https://gitlab.zendesk.com/agent/tickets/342174
- Installation type: Omnibus
-
Sidekiq settings:
ruby /opt/gitlab/embedded/service/gitlab-rails/bin/sidekiq-cluster -e production -r /opt/gitlab/embedded/service/gitlab-rails -m 10 --timeout 25 *
sidekiq['queue_groups'] = ['*']
- Shouldn't be affected due to the routing rules change.
-
https://gitlab.zendesk.com/agent/tickets/337993
- Installation type: Omnibus
-
Sidekiq settings:
ruby /opt/gitlab/embedded/service/gitlab-rails/bin/sidekiq-cluster -e production -r /opt/gitlab/embedded/service/gitlab-rails -m 50 --timeout 25 *
sidekiq['queue_groups'] = ['*']
- Shouldn't be affected due to the routing rules change.
- https://gitlab.zendesk.com/agent/tickets/334091
Note: 15.4 is released on 22nd September 2022.
Some Sidekiq settings were derived from the ticket or SOS file (if customer didn't provide gitlab.rb
)
Common patterns from the incidents
From 15.4, users without routing_rules
, use queue_selector
and dedicated processes in queue_groups
for certain attributes (eg 'resource_boundary!=cpu&urgency=high'
) experienced slowness in clearing Sidekiq jobs. This happens because the default routing_rules
pushes all jobs to default
queue (as opposed to a single queue per worker previously), thus Sidekiq processes that are not listening to *
will be idle. In other words, only processes in queue_groups
with *
is working.
Resolution
For affected customers:
-
Quick fix is to advise customers to update
sidekiq['routing_rules'] = [['*', nil]]
while still maintaining their queue selectors. -
We can also inform/advise customers to start using custom
routing_rules
, translating theirqueue_groups
from queue selector to the form ofrouting_rules
mapping. But this could potentially put a lot of work on to Support team, and would be prone to error since they would need to come up with queue names, and list those queue names in thequeue_groups
manually without usingqueue_selector
. (unless gitlab-org/omnibus-gitlab!6289 (closed) is already merged). We would need to update our documentation to encourage the use of routing rules instead of queue selector and explain the relationship between them.- Example:
- With current
gitlab.rb
withqueue_selector
:
# sidekiq['routing_rules'] = [] sidekiq['enable'] = true sidekiq['queue_selector'] = true sidekiq['queue_groups'] = [ # Run all non-CPU-bound queues that are high urgency 'resource_boundary!=cpu&urgency=high', # Run all continuous integration and pages queues that are not high urgency 'feature_category=continuous_integration,pages&urgency!=high', # Run all queues '*' ]
- Updated
gitlab.rb
usingrouting_rules
and disabledqueue_selector
:
sidekiq['routing_rules'] = [ # All non-CPU-bound jobs that are high urgency go to 'urgent_other' queue ['resource_boundary!=cpu&urgency=high', 'urgent_other'], # All continuous integration and pages jobs that are not high urgency go to 'ci_pages' queue ['feature_category=continuous_integration,pages&urgency!=high', 'ci_pages'], # All other jobs go to 'default' queue ['*', 'default'] ] sidekiq['enable'] = true sidekiq['queue_selector'] = false sidekiq['queue_groups'] = [ 'urgent_other', 'ci_pages', 'default' ]
sidekiq['queue_groups'] = ['urgent_other', 'urgent_other', 'ci_pages', 'ci_pages', 'default']
- With current
- Example:
For customers yet to upgrade to 15.4:
-
15.6 - Revert default routing rules on rails app - gitlab-org/gitlab!103483 (merged)
2. [ ] 15.6 - Add the logic in Omnibus & charts that sets routing_rules
to [['*', nil]]
if routing_rules not already set and queue_groups only consists of *
s (can be multiple). We don't want to mess up with default routing_rules
, when users are touching the queue_groups
.
- [ ] gitlab-org/omnibus-gitlab!6511 (closed)
- [ ] TODO: Charts MR
-
Backport the revert in (1) to 15.4 and 15.5 - https://gitlab.com/gitlab-org/release/tasks/-/issues/4522
4. [ ] 15.7 - Reintroduce default routing rules on rails app