Allow sidekiq cluster to select queues to execute based on a query of worker attributes
Currently, sidekiq cluster has two options for configuring the queues that the sidekiq processes will listen to:
A list of queues groups. Each queue group specification contains a mixture of queue prefixes and queues. The sidekiq cluster will expand queue prefixes into all queues containing that prefix.
negateoption, which will return the inverse of the specified list of queues. This is used to specify the
besteffortcatchall priority, using
negateand a list of all queues specified in other queue groups.
This approach works well up to a certain scale but has certain drawbacks:
- Operators of the cluster, not the application engineers, needs to configure the cluster to handle different workloads
- Operators do not have a good understand of the workload, unlike application engineers, and workloads tend to fallback to the default catchall
Over June, July, August 2019 the complexity of the sidekiq cluster configuration contributed to several outages.
In &96 (closed), some of this complexity is being addressed.
Proposal: Allow sidekiq cluster to select queues to execute based on a query of worker attributes
Instead of the current approach, of allowing operators to configure the queues, provide a way for operators to instead select queues based on worker attributes.
Instead of only allowing queue names and queue prefixes as we do at present, allow queue attributes to be selected.
The syntax is loosely based on PromQL selectors:
Operators (descending priority):
|: concatenate set
=: equality operator
!=: non equality operator
,: a logical AND operator
Working with the syntax
A quick guide to using this query syntax follows:
The right hand side of every term is a set containing one or more values. Values can be included in the set using the
A filter term has the form
<field><operator><set>. The field, on the left hand side, refers to the attributes associated with a worker. Currently, these are
The operator can be either:
=the worker's attribute on the left hand side MUST have a value contained within the set on the right hand side
!=the worker's attribute on the left hand side MUST NOT have a value contained within the set on the right hand side
Some example terms:
resource_boundary=cpu. Will match all workers with a resource boundary of
resource_boundary=cpu|memory. Will match all workers with a resource boundary of
Multiple terms can be added together using
, as a conjunction. When
, is used, all the terms must match in order for a worker to be selected.
, works as an AND operator.
Multiple clauses can be added together using
Examples using the existing queue groups
Here are examples of how the existing queue groups can be defined using this syntax:
ASAP priorities are CPU bound, latency sensitive, but (for now) exclude importer, continuous integration and pages queues.
sidekiq-cluster --select-queues 'resource_boundary=cpu,latency_sensitive=true,feature_category!=importers|continuous_integration|pages'
Realtime priorities are not CPU bound, latency sensitive, but (for now) exclude importer, continuous integration and pages queues.
sidekiq-cluster --select-queues 'resource_boundary!=cpu,latency_sensitive=true,feature_category!=importers|continuous_integration|pages'
Realtime priorities run all worker queues associated with continuous integration.
sidekiq-cluster --select-queues 'feature_category=continuous_integration'
sidekiq-cluster --select-queues 'feature_category=pages'
sidekiq-cluster --select-queues 'feature_category=importers'
sidekiq-cluster --select-queues 'resource_boundary=memory'
Concatenate the above categories together with a space, then use
sidekiq-cluster --select-queues 'resource_boundary=cpu,latency_sensitive=true,feature_category!=importers|continuous_integration resource_boundary=unknown,latency_sensitive=true,feature_category!=importers|continuous_integration feature_category=continuous_integration feature_category=pages feature_category=importers resource_boundary=memory' --negate
As described in &96 (closed), the intention would be to simplify the queue groups over time, but this shows that the current queue group specifications should be configurable with this syntax.
Improvements over current approach
- We don't need to extend
sidekiq-queues.ymlwith additional information
- Chef doesn't become dependent on the Rails application, and doesn't need to run
sidekiq-queues.yml. Less dependencies = less fragile
- The queries mean that sidekiq will continue to run the correct jobs across deploys, even when workers change attributes, or new workers are added. As soon as it's deployed, it will be processed in the correct queue.