Allow sidekiq cluster to select queues to execute based on a query of worker attributes

Currently, sidekiq cluster has two options for configuring the queues that the sidekiq processes will listen to:

A list of queues groups. Each queue group specification contains a mixture of queue prefixes and queues. The sidekiq cluster will expand queue prefixes into all queues containing that prefix.
The negate option, which will return the inverse of the specified list of queues. This is used to specify the besteffort catchall priority, using negate and a list of all queues specified in other queue groups.

This approach works well up to a certain scale but has certain drawbacks:

Operators of the cluster, not the application engineers, needs to configure the cluster to handle different workloads
Operators do not have a good understand of the workload, unlike application engineers, and workloads tend to fallback to the default catchall

Over June, July, August 2019 the complexity of the sidekiq cluster configuration contributed to several outages.

In &96 (closed), some of this complexity is being addressed.

Proposal: Allow sidekiq cluster to select queues to execute based on a query of worker attributes

Instead of the current approach, of allowing operators to configure the queues, provide a way for operators to instead select queues based on worker attributes.

Instead of only allowing queue names and queue prefixes as we do at present, allow queue attributes to be selected.

The syntax is loosely based on PromQL selectors:

Operators (descending priority):

|: concatenate set
=: equality operator
!=: non equality operator
,: a logical AND operator
(space): a logical OR operator

Working with the syntax

A quick guide to using this query syntax follows:

Sets

The right hand side of every term is a set containing one or more values. Values can be included in the set using the | operator.

Terms

A filter term has the form <field><operator><set>. The field, on the left hand side, refers to the attributes associated with a worker. Currently, these are

resource_boundary
latency_sensitive
feature_category

The operator can be either:

= the worker's attribute on the left hand side MUST have a value contained within the set on the right hand side
!= the worker's attribute on the left hand side MUST NOT have a value contained within the set on the right hand side

Some example terms:

resource_boundary=cpu. Will match all workers with a resource boundary of cpu
resource_boundary=cpu|memory. Will match all workers with a resource boundary of cpu or memory

Clauses

Multiple terms can be added together using , as a conjunction. When , is used, all the terms must match in order for a worker to be selected. , works as an AND operator.

Joining Clauses

Multiple clauses can be added together using as a disjunction . When is used, any of the clauses must match in order for the worker to be selected. works as an AND operator and has the lowest precedent.

Examples using the existing queue groups

Here are examples of how the existing queue groups can be defined using this syntax:

asap ASAP priorities are CPU bound, latency sensitive, but (for now) exclude importer, continuous integration and pages queues.

sidekiq-cluster --select-queues 'resource_boundary=cpu,latency_sensitive=true,feature_category!=importers|continuous_integration|pages'

realtime Realtime priorities are not CPU bound, latency sensitive, but (for now) exclude importer, continuous integration and pages queues.

sidekiq-cluster --select-queues 'resource_boundary!=cpu,latency_sensitive=true,feature_category!=importers|continuous_integration|pages'

pipeline Realtime priorities run all worker queues associated with continuous integration.

sidekiq-cluster --select-queues 'feature_category=continuous_integration'

pages

sidekiq-cluster --select-queues 'feature_category=pages'

pullmirror

sidekiq-cluster --select-queues 'feature_category=importers'

export

sidekiq-cluster --select-queues 'resource_boundary=memory'

besteffort Concatenate the above categories together with a space, then use negate:

sidekiq-cluster --select-queues 'resource_boundary=cpu,latency_sensitive=true,feature_category!=importers|continuous_integration resource_boundary=unknown,latency_sensitive=true,feature_category!=importers|continuous_integration feature_category=continuous_integration feature_category=pages feature_category=importers resource_boundary=memory' --negate

As described in &96 (closed), the intention would be to simplify the queue groups over time, but this shows that the current queue group specifications should be configurable with this syntax.

Improvements over current approach

We don't need to extend sidekiq-queues.yml with additional information
Chef doesn't become dependent on the Rails application, and doesn't need to run sidekiq-queues.yml. Less dependencies = less fragile
The queries mean that sidekiq will continue to run the correct jobs across deploys, even when workers change attributes, or new workers are added. As soon as it's deployed, it will be processed in the correct queue.

Edited Oct 22, 2019 by Andrew Newdigate