Praefect: handling of stale replication jobs
Problem
Replication manager consumes jobs from the queue and updates their state to in_progress
, so no other jobs for the same repository can be picked for processing until processing of the current job is not finished (marked as dead
, failed
or completed
). This is totally fine and works as expected.
The problem raises if job won't be moved to one of those states.
The next jobs for this repository won't be picked then and replication for this repository will stop.
It can happen if Praefect was restarted during processing of the replication job, so the job will remain in in_progress
state.
Proposal
As a solution we can implement a background process to pool database for a stale jobs and change their state to unblock replication for the repository.
Stale job - is a job that has no activity for the last X sec/min.
To update activity of the job we could start a goroutine that will periodically update column of the replication_queue_job_lock
table until replication is finished (once replication is finished this record will be removed automatically - current implementation).
/cc @zj-gitlab @jramsay