Reduce the noise generated by FailedToObtainLockError on chunk migration
Description
Whenever we migrate a build trace chunk, we schedule a worker and perform this operation in an exclusive Redis lock. We generate however a lot of redundant workers, especially when object storage is slow.
The worker itself is described as idempotent!
, but evidence in this Sentry issue -> https://sentry.gitlab.net/gitlab/gitlabcom/issues/1895217 says that redundant workers are not de-duplicated. Perhaps they start in a sequence and do not get de-duplicated when a previous instance is already running.
This error is not an indicator of something not working, we know that everything works fine because chunks are getting migrated. We can however think about reducing the workload and the noise generated by this exception.
Proposal
- Increase initial Runner backoff from 1 second to 2 or 3 seconds, what should significantly reduce the amount of redundant workers
- Catch the exception and log it without sending to Sentry
- Consider adding a metric that would help to understand the rate of this happening
/cc @smcgivern