Ability to route Sidekiq jobs to a different queue
This is the first step for &447 (closed) and &194 (closed). We want to be able to define a mapping in configuration (for source installs, Omnibus, and charts).
Solution
-
The mapping worker routing rule is an array of tuple. Each tuple includes a selector, and corresponding queue. First match wins. If a worker doesn't match any selector, it's queue is translated from the worker name. If a worker matches a selector, but the value is null, it uses a translated queue name. Otherwise, use the queue name specified next to the selector. The reason why we use an array instead of an object is that both JSON specification and Ruby hashes don't guarantee key ordering. That may affect the matching order.
-
We follow a rolling-update strategy so that we try out with
catchall
andmailers
queues first while don't want to mess up other shards. -
On GitLab.com, replicate the sharding configuration in production. All the queue names are set to
null
. As a result, after the routing logic is deployed and/or the configuration is set, all of the jobs are routed to worker-name queues, just like before.
[
["resource_boundary=memory", null],
["feature_category=database&urgency=throttled", null],
["feature_category=gitaly&urgency=throttled", null],
["feature_category=global_search&urgency=throttled", null],
["resource_boundary=cpu&urgency=default,low", null]
["resource_boundary=cpu&urgency=high&tags!=requires_disk_io", null]
["resource_boundary!=cpu&urgency=high", null]
["*", null]
]
-
On self-managed instances, it's likely that the routing is not configured. Hence, the jobs are routed to worker-name queues, same as above.
-
(Optional) We could add the matching selector into Sidekiq structured logs. After the change is deployed, even though the jobs are still routed the same as before, we can verify whether a worker matches a selector as we expected. This boosts our confidences.
-
Support worker name in queue selector attributes. For example:
worker_name=PagesWorker|worker_name=AdminEmailWorker
-
Updating the configuration to roll out one by one. At first, we can be very defensive, and test out with some minor, safe, and tolerant workers first:
[
["worker_name=PagesWorker|worker_name=AdminEmailWorker", "default"],
["resource_boundary=memory", null],
["feature_category=database&urgency=throttled", null],
//...,
["*", null]
]
- Next, continue to roll out all jobs inside
default
:-
catchall
is listening to default, hence continue to process the workers in the default queue -
catchall
configuration is still listening to the worker-name queues, hence, if a worker's queue name doesn't fall intodefault
, it's still processed bycatchall
- If a worker should not be in
default
queue, but tagged asdefault
, it is handled bycatchall
. We can detect such outliers easily.
-
[
//...,
["resource_boundary=cpu&urgency=high&tags!=requires_disk_io", null]
["resource_boundary!=cpu&urgency=high", null]
["*", "default"]
]
- Keep only
default
andmailer
incatchall
shard. #989 (closed) - Rolling to other shard, one by one. As the new queue name (
memory-bound
for example) is newly introduced, only the targeted shard consumes the jobs inside that queue. We are expecting the workload doesn't change for each shard, and the jobs continue to be pushed to the right shard.- Add queue name to the queue selector of the sidekiq shard
resource_boundary=memory|name=memory-bound
- Update the queue name item in the worker routing map
- Add queue name to the queue selector of the sidekiq shard
[
["resource_boundary=memory", "memory-bound"],
//...
]
- Remove the redundant selector rules, just keep the queue name rules.
- Update the fallback queue mentioned in the first step to
default
.