Auto-recover stuck Merge Train
We recently had a problem on www-gitlab-com that an MR was stuck/clogged in merge train and prevents the following MRs from being merged. This problem was that
AutoMergeProcessWorker did not work for the problematic MR at that time by some reasons (production incident, race condition, etc). This problem was mitigated by the workaround described below.
Since sidekiq (background process) might not be able to be finish the job when production incident happens, we should have a periodic worker to recover stuck trains.
- GitLab checks the status of the first queue in merge trains repeatedly. And if it's considered as stuck, re-process the problematic MR from the train.
- We have a cron worker
StuckCiJobsWorkerto drop stuck pipeline jobs. We can introduce a similar worker for the merge train domain.
- The system can get the first MRs on all merge trains by
SELECT DISTINCT ON (project_id) merge_trains.merge_request_id FROM merge_trains WHERE status IN ('idle', 'stale', 'fresh') ORDER BY project_id, id
- The status
stuckis defined as 1) The merge train pipeline finished (success, failed, etc) 2) The pipeline finished 10 minutes ago 3) The MR is still on the train.
- This can be translated into
merge_request.actual_head_pipeline.success? && merge_request.actual_head_pipeline.finished_at < 10.minutes.ago
Here is the quick workaround, until we deliver the auto-recovery functionality.
- Login to the rails console
> project = Project.find_by_full_path('gitlab-com/www-gitlab-com') # Get a target project > AutoMergeProcessWorker.perform_async(project.merge_trains.active.first.merge_request.id) # Re-run the AutoMerge Process Job