Improve the way MergeRequestCleanupRefsWorker gets scheduled
Right now, the ScheduleMergeRequestCleanupRefsWorker
runs every minute, and schedules 180 merge requests to be cleaned, assuming that this many merge requests can be dealt with within that minute.
When one of the merge requests wasn't successfully cleaned up, or isn't finished, it will be rescheduled on the next run. If there is a set of merge requests that can't be cleaned for whatever reason, they'll take up a slot of the 180 "forever". Furthermore, if the jobs run way longer (as we've seen in gitlab-com/gl-infra/production#3288 (closed)) we could schedule a while bunch of jobs on top of each other, all hitting the same gitaly node.
So I think we need better control over how many of these jobs we run simultaneously.
MergeRequestCleanupRefsWorker
into a LimitedCapacity::Worker
Suggestion: Refactor the Since the MergeRequest::CleanupSchedule
is already it's separate table that contains all of the work to be done, we could use a FOR UPDATE SKIP LOCKED
database lock for the worker to find its own work while not getting in the way of others jobs running.
This does mean that we would need to keep track of the status of a schedule on that row. And we set the status of the row inside the transaction. An example implementation of this is the way we select container registry repositories for clean-up: https://gitlab.com/gitlab-org/gitlab/blob/28ae2160d01a69430716f332bfb128b592c26dc8/app/workers/container_expiration_policies/cleanup_container_repository_worker.rb#L80
That way, we have better control on how many jobs run at the same time, without being surprised if one of the jobs runs longer.
We could also keep track if a clean-up has failed to many times, to avoid repeating it.