Use routing rules by default and deprecate queue selectors for self-managed
DRI: @marcogreg ## Background Over the last few years we've worked on a couple of iterations for better managing our Sidekiq workloads: - Route background jobs by their characteristics (https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/174) introduced a **queue selector** for Sidekiq, so operators of a GitLab instance can specify that a Sidekiq process listens to, for instance, only high-urgency queues. This assumes that each Sidekiq worker class has its own named queue. - Single queue per sidekiq shard (https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/469) introduced **routing rules**, which work like queue selectors but transform the queues used by Sidekiq worker classes. Using this model, you can route all high-urgency jobs to a single queue. This has large benefits for Redis CPU consumption. Having both of these is unnecessary and a form of tech debt: we aren't using queue selectors on GitLab.com, but we need to keep supporting them in some form. As we're using routing rules, it would be good to make that the default for the simplest cases everywhere. Migrating self-managed instances is challenging, though, because we do not control anything beyond the defaults for self-managed. Both queue selectors and routing rules introduce possibilities for configuration failure, although these are unlikely for most self-managed instances. The failure mode is the same: the queue list as configured is not exhaustive and so the application will simply not process some queues. Routing rules make this better (fewer queues) but also worse (operators can pick arbitrary queue names). From https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1991, we know that a fair amount of customers are using queue selectors. So, we still need to support queue selectors, while at the same time pushing for adding default routing rules (routing all jobs to `default` or `mailers` queue). While supporting both queue selectors and routing rules, we can lay down the goals as: 1. Everyone automatically has the same number of Sidekiq processes as before. 1. Those Sidekiq processes have the same concurrency as before. 1. All jobs go to `default` or `mailers`. 1. Upgrades work without manual intervention. ## Plan To accomplish the above goals, the plan would be: 1. Mark queue selector as deprecated in 15.9. 1. [Reintroduce default routing rules if routing rules are not set](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/97908) in 16.0 (previously reverted). 2. Change `sidekiq-cluster` binary to include `default,mailers` in each process, if routing rules are the default, in 16.0. This way, even when jobs only go to `default` and `mailers`, and queue selector is the running mode, all Sidekiq processes are still working to process jobs with the same concurrency. 3. By now, we will have covered all of the scopes in this Epic. For completeness sake, we should also plan on the removal of queue selector, such as: 1. Mechanism (migrate jobs, ensure concurrency is not too low, etc) 2. Which version to remove? Note that self-managed instances won't benefit from the reduced Redis CPU usage from this plan. Instead, it allows us to safely introduce routing rules, then pave the way to eventually clean the tech debt of deprecating and eventually removing queue selector. ## Summary As a part of previous epics, the new routing rule system is only used on GitLab.com. The default configuration we deliver to the customers is still using queue selector. Plus, there are some workers that depend on their own queue sizes. Those workers blocks the migration. Therefore, the exit criteria would be: - [x] We have no remaining workers with `needs_own_queue` - [x] Routing rules are in use by default for the GDK - [x] The queue selector is marked as deprecated - [x] Routing rules are in use by default for self-managed instances - [x] Each process in Sidekiq cluster is listening to additional default and mailers queue (if using default routing rules) - [x] ~~Write some small scripts to support the migration~~ (not needed anymore) - [x] We have created a follow-up project / issue to fully remove the queue selector https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2220 ## Status 2023-05-02 We have wrapped up the project with the [last MR merged](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111675) to [reintroduce routing rules by default](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1491) for self-managed instances in 16.0. The queue selectors have been deprecated in 15.9 with removal in 17.0. Removal progress will be tracked in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2220.
epic