Geo: Do better at ensuring important jobs are run, e.g. after transient problems

From https://gitlab.com/gitlab-org/gitlab-ee/issues/5876#note_109338961:

Retries are limited in number, and may fail and disappear before the other event finishes.

Should we increase the number of job retries? We changed the Sidekiq default of most jobs to 3 retries before they move to the dead queue, capped at 10k jobs, and then they disappear.

Problem

Certain jobs that replicate state in the secondary are too important to lose, if they might succeed an hour or so later. This includes at least repo renames, deletes, and migration to hashed storage. When these are lost, we are left with bad data on the secondary.

Not problem

We don't have to worry about repo syncs, because the whole registry system is designed around managing sync state. It will resync eventually.

Background

We can't truly "guarantee" that a job will run, unless we are willing to keep it around indefinitely, which we are not.

But we can at least allow retries for long enough that the job will only disappear if it probably won't ever run.

Currently, 3 retries puts the last try at 2m 30s.

Sidekiq's default error handling: https://github.com/mperham/sidekiq/wiki/Error-Handling

Initial proposed solution

Allow most Geo jobs at least 8 retries, which puts the last try around 1h 28m. But I would prefer 14 (1 day).

Investigate each worker. If losing a job equals bad data somewhere, then consider up to the Sidekiq default of 25 retries (20 days). Or, increase the dead jobs queue size so we can rerun crucial jobs after we deploy a fix.

For every worker in GitLab CE and EE, follow this decision tree.

graph TD
A(If this job is lost, is data lost?) -->|Yes| B(If this job is retried up to 25 times over 20 days, is there an idempotence or performance concern?)
A --> |No| C(Consider using the default 3 retries)
B --> |Yes| D(Open issue to resolve blocker to increasing retries)
B --> |No| E(Does it make sense to retry for 20 days?)
E --> |Yes| F(Open an issue?)
E --> |No| G(Consider setting retries to 25)

Latest proposed solution

The default retries was changed back to 25 at some point. So I think we mostly just need to go through and remove any "3 retries" that were explicitly set at the time of the change (to maintain the same behavior). On any critical jobs.

Edited Mar 07, 2022 by Michael Kozono