Skip to content

Self-managed users experiencing degraded Sidekiq performance due to enabling queue selector and defaulted routing rules

Background

In %15.4, we defaulted routing rules in Rails app, which results in all jobs are routed to default queue. This reduces the number of queues from 400+ (queue per-worker configuration) to 2 queues (default and mailers) for typical self-managed users.

We wrongly assumed that most users are not using queue selector, and since Sidekiq's queue_groups are defaulted to * anyway (https://gitlab.com/gitlab-org/omnibus-gitlab/-/blob/cba33606/files/gitlab-cookbooks/gitlab/attributes/default.rb#L599), the change was deemed safe.

Incident

Ever since the %15.4 release, several Zendesk tickets regarding bad Sidekiq performance start to increase (at least 9 tickets) as reported by the Support team.

List of tickets and summaries

  • https://gitlab.zendesk.com/agent/tickets/336502
    • Installation type: Docker
    • Sidekiq settings:
      sidekiq['queue_groups'] = [
        # Run all non-CPU-bound queues that are high urgency
        'resource_boundary!=cpu&urgency=high',
        # Run all continuous integration and pages queues that are not high urgency
        'feature_category=continuous_integration,pages&urgency!=high',
        # Run all queues
        '*'
      ]
  • https://gitlab.zendesk.com/agent/tickets/325926
    • Installation type: Omnibus
    • Sidekiq settings:
      sidekiq['queue_selector'] = true
      sidekiq['queue_groups'] = [
        "*",
        "*",
        "*",
        "*"
      ]
    • Ticket was logged on 14th Sept, before %15.4 is released, this ticket is unrelated. It was resolved by increasing queue_groups
  • https://gitlab.zendesk.com/agent/tickets/340214
    • Installation type: Docker
    • Sidekiq settings:
      sidekiq['min_concurrency'] = 5
      sidekiq['max_concurrency'] = 15
      sidekiq['queue_selector'] = true
      sidekiq['queue_groups'] = [
        "urgency=high",
        "urgency=high",
        "urgency=low",
        "urgency=low",
        "urgency=low",
        "urgency=throttled",
        "urgency=throttled",
        "*"
      ]
  • https://gitlab.zendesk.com/agent/tickets/337432
    • Installation type: Docker
    • Sidekiq settings:
      sidekiq['enable'] = true
      sidekiq['queue_selector'] = true
      sidekiq['queue_groups'] = [  'urgency=high',  '*']
  • https://gitlab.zendesk.com/agent/tickets/341725
    • Installation type: Docker
    • Sidekiq settings:
      sidekiq['enable'] = true
      sidekiq['queue_selector'] = true
      sidekiq['queue_groups'] = [
        # Extra resources to support merge requests
        'feature_category=continuous_integration,source_code_management',
        'feature_category=continuous_integration,source_code_management',
        'feature_category=continuous_integration,source_code_management',
        'feature_category=continuous_integration,source_code_management',
        # Run all continuous integration and pages queues that are high urgency
        'feature_category=continuous_integration,pages&urgency=high',
        # Run all integrations queues
        'feature_category=integrations',
        # Run all non-CPU-bound queues that are high urgency
        'resource_boundary!=cpu&urgency=high',
        # Run all queues
        '*'
      ]
  • https://gitlab.zendesk.com/agent/tickets/341178
    • Installation type: Docker
    • Sidekiq settings: NA (SOS file yet to be provided)
  • https://gitlab.zendesk.com/agent/tickets/342174
    • Installation type: Omnibus
    • Sidekiq settings:
      ruby /opt/gitlab/embedded/service/gitlab-rails/bin/sidekiq-cluster -e production -r /opt/gitlab/embedded/service/gitlab-rails -m 10 --timeout 25 *
      Equivalent to sidekiq['queue_groups'] = ['*'] - Shouldn't be affected due to the routing rules change.
  • https://gitlab.zendesk.com/agent/tickets/337993
    • Installation type: Omnibus
    • Sidekiq settings:
      ruby /opt/gitlab/embedded/service/gitlab-rails/bin/sidekiq-cluster -e production -r /opt/gitlab/embedded/service/gitlab-rails -m 50 --timeout 25 *
      Equivalent to sidekiq['queue_groups'] = ['*']- Shouldn't be affected due to the routing rules change.
  • https://gitlab.zendesk.com/agent/tickets/334091

Note: 15.4 is released on 22nd September 2022.

Some Sidekiq settings were derived from the ticket or SOS file (if customer didn't provide gitlab.rb)

Common patterns from the incidents

From 15.4, users without routing_rules, use queue_selector and dedicated processes in queue_groups for certain attributes (eg 'resource_boundary!=cpu&urgency=high') experienced slowness in clearing Sidekiq jobs. This happens because the default routing_rules pushes all jobs to default queue (as opposed to a single queue per worker previously), thus Sidekiq processes that are not listening to * will be idle. In other words, only processes in queue_groups with * is working.

Resolution

For affected customers:

  • Quick fix is to advise customers to update sidekiq['routing_rules'] = [['*', nil]] while still maintaining their queue selectors.

  • We can also inform/advise customers to start using custom routing_rules, translating their queue_groups from queue selector to the form of routing_rules mapping. But this could potentially put a lot of work on to Support team, and would be prone to error since they would need to come up with queue names, and list those queue names in the queue_groups manually without using queue_selector. (unless gitlab-org/omnibus-gitlab!6289 (closed) is already merged). We would need to update our documentation to encourage the use of routing rules instead of queue selector and explain the relationship between them.

    • Example:
      • With current gitlab.rb with queue_selector:
      # sidekiq['routing_rules'] = []
      sidekiq['enable'] = true
      sidekiq['queue_selector'] = true
      sidekiq['queue_groups'] = [
        # Run all non-CPU-bound queues that are high urgency
        'resource_boundary!=cpu&urgency=high',
        # Run all continuous integration and pages queues that are not high urgency
        'feature_category=continuous_integration,pages&urgency!=high',
        # Run all queues
        '*'
      ]
      • Updated gitlab.rb using routing_rules and disabled queue_selector:
      sidekiq['routing_rules'] = [
        # All non-CPU-bound jobs that are high urgency go to 'urgent_other' queue
        ['resource_boundary!=cpu&urgency=high', 'urgent_other'],
        # All continuous integration and pages jobs that are not high urgency go to 'ci_pages' queue
        ['feature_category=continuous_integration,pages&urgency!=high', 'ci_pages'],
        # All other jobs go to 'default' queue
        ['*', 'default']
      ]
      sidekiq['enable'] = true
      sidekiq['queue_selector'] = false
      sidekiq['queue_groups'] = [
        'urgent_other',
        'ci_pages',
        'default'
      ]
      Note: Users can spin up more processes of a certain queue as before, e.g. sidekiq['queue_groups'] = ['urgent_other', 'urgent_other', 'ci_pages', 'ci_pages', 'default']

For customers yet to upgrade to 15.4:

  1. 15.6 - Revert default routing rules on rails app - gitlab-org/gitlab!103483 (merged)

2. [ ] 15.6 - Add the logic in Omnibus & charts that sets routing_rules to [['*', nil]] if routing_rules not already set and queue_groups only consists of *s (can be multiple). We don't want to mess up with default routing_rules, when users are touching the queue_groups. - [ ] gitlab-org/omnibus-gitlab!6511 (closed) - [ ] TODO: Charts MR

  1. Backport the revert in (1) to 15.4 and 15.5 - https://gitlab.com/gitlab-org/release/tasks/-/issues/4522

4. [ ] 15.7 - Reintroduce default routing rules on rails app

Edited by Marco Gregorius