Auto-recover stuck Merge Train
Problem
We recently had a problem on www-gitlab-com that an MR was stuck/clogged in merge train and prevents the following MRs from being merged. This problem was that AutoMergeProcessWorker
did not work for the problematic MR at that time by some reasons (production incident, race condition, etc). This problem was mitigated by the workaround described below.
Since sidekiq (background process) might not be able to be finish the job when production incident happens, we should have a periodic worker to recover stuck trains.
Proposal
- GitLab checks the status of the first queue in merge trains repeatedly. And if it's considered as stuck, re-process the problematic MR from the train.
- We have a cron worker
StuckCiJobsWorker
to drop stuck pipeline jobs. We can introduce a similar worker for the merge train domain. - The system can get the first MRs on all merge trains by
SELECT DISTINCT ON (project_id) merge_trains.merge_request_id FROM merge_trains WHERE status IN ('idle', 'stale', 'fresh') ORDER BY project_id, id
- The status
stuck
is defined as 1) The merge train pipeline finished (success, failed, etc) 2) The pipeline finished 10 minutes ago 3) The MR is still on the train. - This can be translated into
MergeTrain#stuck?
=>merge_request.actual_head_pipeline.success? && merge_request.actual_head_pipeline.finished_at < 10.minutes.ago
When the merge train in the www-gitlab-com project might be stuck
Workaround
Here is the quick workaround, until we deliver the auto-recovery functionality.
- Access to the Merge Train API in the browser https://gitlab.com/api/v4/projects/7764/merge_trains?scope=active&sort=asc&per_page=100. You can search for and count the number of occurrences of
iid
to get an idea how many total MRs are currently in the train (up to theper_page
parameter max limit of100
). - Check the first
iid
in the payload. e.g. if it starts with[{"id":98945,"merge_request":{"id":67833666,"iid":59834,"project_id":7764,"title":"Add YouTube playlists for Distribution tea...
, then59834
is the first MR (likely clogged) in the train. Alternatively, look for the first instance ofweb_url
that contains the MR URL. - Click "Remove from merge train" button and re-add it. Note that the Maintainer role is required to see the button
- Refresh the API page in the browser, and verify that the MR is no longer the first one in the queue. If it is, see "Gotchas" below.
- Continue to monitor the number of MRs in the train until the backlog is worked off. Ensure that the first one is always currently running some pipeline with actively running jobs, and not stuck. If it is, repeat these steps.
- See also the issue for looking into the backend Gitaly performance problems related to this, specifically links to APDEX charts.
- See also the issue for "Stuck Merge Train because of GitError"
Gotchas
The above doesn't always work. Sometimes, the MR in question will just go back to the beginning of the queue. In that case:
- Remove the MR in question from the train.
- Manually re-run its pipeline.
- Click the button to re-add it to the train when the pipeline passes.
Also, in some cases the first MR in the queue may still be actively running jobs, contrary to the instructions above to "Ensure that the first one is always currently running some pipeline with actively running jobs". In the one known instance where this happened, it eventually cleared itself up, and appeared to be related to this issue: #255281 (closed)
Legacy Workaround
This requires rails console access.
Legacy workaround
1. Login to the rails console> project = Project.find_by_full_path('gitlab-com/www-gitlab-com') # Get a target project
> AutoMergeProcessWorker.perform_async(project.merge_trains.active.first.merge_request.id) # Re-run the AutoMerge Process Job