Handle jobs for unknown (partially deployed) workers
In the past, we've relied on each worker having its own queue for jobs. This meant that when we deployed to canary, and canary started scheduling jobs for newly introduced worker classes, they would go to queues that are currently not watched by any Sidekiq pods. They would sit there until the deploy was promoted to production, at which point the jobs would start being processed.
If we start scheduling jobs for new workers on queues that are already being watched by Sidekiq, this would cause the jobs to be picked up and fail because the worker class cannot be instantiated.
By default, Sidekiq would rely on its retry mechanism with exponential back-off, causing the jobs to survive the deploys (up-to 20 days) for 25 retries. But since we have a very low default retry threshold (#986), the jobs wouldn't survive more than a couple of minutes.
To address this, we can:
- Update all existing workers without an explicit number of retries set to use 3 retries explicitly.
- Make the default for new workers (in ApplicationWorker) 25 retries.
- Maybe in a separate issue: add a RuboCop check to forbid
sidekiq_options retry: 3
, and exclude all existing workers by path (to avoid copying and pasting this to new workers).
Then we can use Sidekiq's job retry mechanism to handle this for us.