Some sidekiq jobs will restart hundreds of times on GitLab.com
For example, https://log.gitlab.net/goto/7e69c3b256bd0e8c3eb37c1b4a31abb5 (7 day retention in ELK, so these results may be lost shortly)
- A single API project export request on the 11th of July to
/api/v4/projects/:id/export
,correlation_id
:ZWDFZlTOl7a
- Since then there have, over the past 5 days, been 412 attempts to execute this job
- The job appears to start at an interval of 20 minutes. No other details of the job are available.
- The job has generated 250k Gitaly requests to date, it is probably also generating a lot of load on Postgres
- Retry is not decreasing on each attempt
Corrective actions / Mechanical Sympathy
- Presumably the job is failing. Why are we not seeing logs?
- Catastrophic failures (in which the worker process is killed) should be handled correctly: that is, the retry should still decrease. This could be done by decreasing the retry counter before attempting to run the job.
- Should we place a maximum time limit on all sidekiq jobs. In other words: any job that was enqueued more than XX hours ago (24? 36?) is automatically declined and sent to deadletter queue.