Skip to content

Draft: Shorten MergeTrains::RefreshWorker life span and add a worker to regularly fix stuck Trains

drew stachon requested to merge remove-deduplicate-mt-refresh-worker into master

What does this MR do and why?

This MR changes the locking mechanism we use to prevent concurrent executions of the MergeTrain::RefreshService on the same merge train.

Instead of using deduplicate with the standard 6 hour lock TTL, we use a SleepingLock from ExclusiveLeaseHelpers with a 4-minute TTL. RefreshService itself has been modified to stop refreshing cars on the merge train after three minutes and return an error, causing the worker to immediately re-queue another job.

Effectively, the MergeTrain::RefreshWorker will continue running on a given Merge Train until all the cars have been refreshed, but by shortening the lifespan of each we can become more fault tolerant, allowing for the RefreshWorker to be triggered more often.

This change, on it's own, shouldn't affect the execution of the RefreshService or solve the stuck MergeTrain problem as described in Merge request stuck in locked state when gettin... (#389044). But it does allow us to introduce a new worker that can detect what we believe to be a stuck MergeTrain, and fire off a new RefreshWorker job to take care of it, on a 5-minute interval.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Merge request reports