Prevent concurrent executions of Ci::ArchiveTracesCronWorker jobs
This is from gitlab-com/gl-infra/production#8209 (closed) where three concurrent executions of the worker caused lock contention and failed to archive job traces:
During this period, these are the error messages from the logs here https://log.gprd.gitlab.net/goto/d3a51c90-8d8a-11ed-85ed-e7557b0a598c
- Failed to archive trace. message: Failed to obtain a lock. (7111 occurrances)
- The job can not be archived right now. (4231)
- The job does not have live trace but going to be archived. (4034)
- others
We should work on ArchiveTracesCronWorker
to reduce the chances of multiple instances running at the same time, like we do for ExpireArtifactsWorker
, mainly:
- use
deduplicate :until_executed, including_scheduled: true
to attempt to prevent sidekiq from scheduling multiple jobs - run the business logic with an exclusive lock and timeout, as an additional layer of security, same as the expire artifacts one.
The worker executes every hour at minute 17, so the lock timeout should be less than 1 hour.
Edited by Marius Bobin