Skip to content

Set Sidekiq default retries back to 25 (only for new workers) [RUN AS-IF-FOSS]

Sean McGivern requested to merge default-new-workers-to-25-retries into master

Summary

In gitlab-foss!7294 (merged), we set the default retries for Sidekiq to 3. I think in hindsight that was a bit hasty: Sidekiq has picked 25 as a sensible default, and 3 is very low. It means that when, for instance, our database goes down for 5 minutes, we will lose all running jobs (3 retries take a couple of minutes: https://github.com/mperham/sidekiq/wiki/Error-Handling#automatic-job-retry).

I think - again, in hindsight - a better option would be to keep the default at 25, but explicitly set jobs to 3. That's what this MR does. To avoid breaking existing behaviour, it explicitly sets all existing Sidekiq workers to use 3 retries (if they didn't specify already), and adds specs for that.

New workers

This will also let us push new jobs to a running queue with a new worker and let the retries handle it. Currently, when we add a new worker, we also add a new queue. When canary is deployed to the web, API, and git services, we can scheduled new jobs for that worker. They won't run until the Sidekiq service - in the main stage - is deployed and starts listening to the new queue. This is already a little bit surprising!

In gitlab-com/gl-infra&447 (closed), we want to reduce the number of queues we listen to, so that we no longer have a single queue per worker. That will break the above 'solution': jobs for a new worker scheduled by canary that doesn't exist in main will go to an existing queue, be picked up, and fail. With 3 retries, they would fail permanently very quickly. With 25 retries, we can just let the retry mechanism handle that, which seems better-aligned to upstream.

Issue links

For gitlab-com/gl-infra/scalability#996 (closed). Also related to gitlab-com/gl-infra/scalability#986.

Commit messages

Some time ago, we set the default number of Sidekiq retries to 3. In the next commit we're going to change that to 25, but we only want it to apply to new workers. So we'll first need to explicitly set all existing workers to 3.

To aid with this, I used the below script. I had to make a couple of tweaks for some worker concerns, but it covered most of the several hundred workers we have:

Dir.glob('{ee/,}app/workers/**/*.rb').select do |worker|
  contents = open(worker).read

  next unless contents.include?('ApplicationWorker')
  next if contents.include?('sidekiq_options retry: ')

  new_contents =
    contents.gsub(/^( +)include ApplicationWorker$/,
                  "\\1include ApplicationWorker\n\n" +
                  "\\1sidekiq_options retry: 3")

  open(worker, 'w').puts(new_contents)
end

A few years ago we set the default number of retries to 3 globally. The reasoning was that we have a lot of jobs that try to communicate with external services, and that those jobs don't need 25 retries - if the service is down, we shouldn't waste our time trying to send to it the other 22 times.

I think that this is valid, but it has some problems as a default:

  1. Most of our workers don't connect to an external service.
  2. 3 retries will happen over a couple of minutes. If - say - our database goes down for 5 minutes, all jobs that we tried during that period will fail completely. If we allow 25 retries then we have a few weeks to fix that issue, which should be sufficient.
  3. We're rolling out a change to use fewer Sidekiq queues. Because we have a mixed-stage deployment, we can have jobs scheduled from canary using new worker classes that don't exist on the main stage (which runs Sidekiq) yet. Those jobs would just fail immediately. Setting a higher number of retries lets those jobs succeed once the main stage is deployed - albeit with a delay due to the back-off.

The meta-reasoning here is that 25 is the default in Sidekiq for a reason, and in retrospect I think we were too hasty in changing it. Sidekiq is very heavily used software, and its default settings have come out of a lot of hard-won experience. Let's try to use more of that!

Edited by Sean McGivern

Merge request reports