Make the catchall shard use two queues: default and mailers (#447) · Epics · GitLab Infrastructure Team

Make the catchall shard use two queues: default and mailers

## Proposal We create a config option that allows the application to choose where to send a job. By default, with no configuration, it will send the job to the queue name defined in the worker. With configuration, we can use a [queue selector](https://docs.gitlab.com/ee/administration/operations/extra_sidekiq_processes.html#queue-selector) to define an alternative queue for the job instead. For instance, in future we could say that jobs matching `feature_category=search` will go to the `search` queue. Starting with the `catchall` shard has two big advantages: 1. It listens to so many queues that if we can reduce it to two queues - `default` and `mailers` - then we'll see roughly an order of magnitude reduction in the total number of queues we listen to. (From around 350 to around 50.) 2. The `catchall` shard already listens to the `default` queue, even though the queue never performs any work. ## Out of scope DRI: @smcgivern We actually have two `catchall` shards: one on VMs and one on Kubernetes. The scope of this epic is specific to the Kubernetes shard: 1. It currently processes more queues than the VM equivalent: https://thanos-query.ops.gitlab.net/graph?g0.range_input=1d&g0.max_source_resolution=0s&g0.expr=sum%20by%20(cluster)%20(count%20by%20(cluster%2C%20queue)%20(queue%3Asidekiq_jobs_completion%3Arate1m%7Benv%3D%22gprd%22%2C%20shard%3D%22catchall%22%7D))&g0.tab=0 1. It also gets all new workers that run: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/998#note_560559021 1. We will migrate workers from VMs to K8s over time, not the other way around. ## Exit criteria 1. [x] Ability to route jobs to a different queue: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/987 1. [x] Observability operates on workers as well as queues: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/988 1. [x] The catchall kubernetes shard listens to only `default` and `mailers` in production: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/989 Optional follow-up items: 1. [ ] Clean up the catch* sidekiq fleet https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1152 1. [ ] Rake task to find queues not in mapping: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1045 1. [x] Remove alert for unused queues in production: https://gitlab.com/gitlab-com/runbooks/-/merge_requests/3531 1. [ ] Allow moving selected enqueued jobs to another queue: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1080 1. [ ] Guidelines for self-managed instances: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1023 1. [ ] Workers that depend on checking their own queue size: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1087 ## Status 2021-08-13 This is done now! :tada: We are listening to four queues, not two as planned (due to two we found for https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1087). We are seeing new daily CPU utilization peaks of 74%

epic