Ability to route Sidekiq jobs to a different queue

This is the first step for &447 (closed) and &194 (closed). We want to be able to define a mapping in configuration (for source installs, Omnibus, and charts).

Solution

The mapping worker routing rule is an array of tuple. Each tuple includes a selector, and corresponding queue. First match wins. If a worker doesn't match any selector, it's queue is translated from the worker name. If a worker matches a selector, but the value is null, it uses a translated queue name. Otherwise, use the queue name specified next to the selector. The reason why we use an array instead of an object is that both JSON specification and Ruby hashes don't guarantee key ordering. That may affect the matching order.
We follow a rolling-update strategy so that we try out with catchall and mailers queues first while don't want to mess up other shards.
On GitLab.com, replicate the sharding configuration in production. All the queue names are set to null. As a result, after the routing logic is deployed and/or the configuration is set, all of the jobs are routed to worker-name queues, just like before.

[
      ["resource_boundary=memory", null],
      ["feature_category=database&urgency=throttled", null],
      ["feature_category=gitaly&urgency=throttled", null],
      ["feature_category=global_search&urgency=throttled", null],
      ["resource_boundary=cpu&urgency=default,low", null]
      ["resource_boundary=cpu&urgency=high&tags!=requires_disk_io", null]
      ["resource_boundary!=cpu&urgency=high", null]
      ["*", null]
]

On self-managed instances, it's likely that the routing is not configured. Hence, the jobs are routed to worker-name queues, same as above.
(Optional) We could add the matching selector into Sidekiq structured logs. After the change is deployed, even though the jobs are still routed the same as before, we can verify whether a worker matches a selector as we expected. This boosts our confidences.
Support worker name in queue selector attributes. For example: worker_name=PagesWorker|worker_name=AdminEmailWorker
Updating the configuration to roll out one by one. At first, we can be very defensive, and test out with some minor, safe, and tolerant workers first:

[
      ["worker_name=PagesWorker|worker_name=AdminEmailWorker", "default"],
      ["resource_boundary=memory", null],
      ["feature_category=database&urgency=throttled", null],
      //...,
      ["*", null]
]

Next, continue to roll out all jobs inside default:
- catchall is listening to default, hence continue to process the workers in the default queue
- catchall configuration is still listening to the worker-name queues, hence, if a worker's queue name doesn't fall into default, it's still processed by catchall
- If a worker should not be in default queue, but tagged as default, it is handled by catchall. We can detect such outliers easily.

[
      //...,
      ["resource_boundary=cpu&urgency=high&tags!=requires_disk_io", null]
      ["resource_boundary!=cpu&urgency=high", null]
      ["*", "default"]
]

Keep only default and mailer in catchall shard. #989 (closed)
Rolling to other shard, one by one. As the new queue name (memory-bound for example) is newly introduced, only the targeted shard consumes the jobs inside that queue. We are expecting the workload doesn't change for each shard, and the jobs continue to be pushed to the right shard.
- Add queue name to the queue selector of the sidekiq shard resource_boundary=memory|name=memory-bound
- Update the queue name item in the worker routing map

[
      ["resource_boundary=memory", "memory-bound"],
      //...
    ]

Remove the redundant selector rules, just keep the queue name rules.

Update the fallback queue mentioned in the first step to default.

Edited Apr 15, 2021 by Quang-Minh Nguyen