Skip to content
GitLab Next
  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • GitLab GitLab
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 43,816
    • Issues 43,816
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 1,442
    • Merge requests 1,442
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • GitLab.org
  • GitLabGitLab
  • Issues
  • #217908
Closed
Open
Created May 15, 2020 by Shinya Maeda@shinya.maeda🌴Maintainer

Auto-recover stuck Merge Train

Problem

We recently had a problem on www-gitlab-com that an MR was stuck/clogged in merge train and prevents the following MRs from being merged. This problem was that AutoMergeProcessWorker did not work for the problematic MR at that time by some reasons (production incident, race condition, etc). This problem was mitigated by the workaround described below.

Since sidekiq (background process) might not be able to be finish the job when production incident happens, we should have a periodic worker to recover stuck trains.

Proposal

  • GitLab checks the status of the first queue in merge trains repeatedly. And if it's considered as stuck, re-process the problematic MR from the train.
  • We have a cron worker StuckCiJobsWorker to drop stuck pipeline jobs. We can introduce a similar worker for the merge train domain.
  • The system can get the first MRs on all merge trains by SELECT DISTINCT ON (project_id) merge_trains.merge_request_id FROM merge_trains WHERE status IN ('idle', 'stale', 'fresh') ORDER BY project_id, id
  • The status stuck is defined as 1) The merge train pipeline finished (success, failed, etc) 2) The pipeline finished 10 minutes ago 3) The MR is still on the train.
  • This can be translated into MergeTrain#stuck? => merge_request.actual_head_pipeline.success? && merge_request.actual_head_pipeline.finished_at < 10.minutes.ago

When the merge train in the www-gitlab-com project might be stuck

Workaround

Here is the quick workaround, until we deliver the auto-recovery functionality.

  1. Access to the Merge Train API in the browser https://gitlab.com/api/v4/projects/7764/merge_trains?scope=active&sort=asc&per_page=100. You can search for and count the number of occurrences of iid to get an idea how many total MRs are currently in the train (up to the per_page parameter max limit of 100).
  2. Check the first iid in the payload. e.g. if it starts with [{"id":98945,"merge_request":{"id":67833666,"iid":59834,"project_id":7764,"title":"Add YouTube playlists for Distribution tea..., then 59834 is the first MR (likely clogged) in the train. Alternatively, look for the first instance of web_url that contains the MR URL.
  3. Click "Remove from merge train" button and re-add it. Note that the Maintainer role is required to see the button
  4. Refresh the API page in the browser, and verify that the MR is no longer the first one in the queue. If it is, see "Gotchas" below.
  5. Continue to monitor the number of MRs in the train until the backlog is worked off. Ensure that the first one is always currently running some pipeline with actively running jobs, and not stuck. If it is, repeat these steps.
  6. See also the issue for looking into the backend Gitaly performance problems related to this, specifically links to APDEX charts.
  7. See also the issue for "Stuck Merge Train because of GitError"

Gotchas

The above doesn't always work. Sometimes, the MR in question will just go back to the beginning of the queue. In that case:

  1. Remove the MR in question from the train.
  2. Manually re-run its pipeline.
  3. Click the button to re-add it to the train when the pipeline passes.

Also, in some cases the first MR in the queue may still be actively running jobs, contrary to the instructions above to "Ensure that the first one is always currently running some pipeline with actively running jobs". In the one known instance where this happened, it eventually cleared itself up, and appeared to be related to this issue: #255281 (closed)

Legacy Workaround

This requires rails console access.

Legacy workaround 1. Login to the rails console
> project = Project.find_by_full_path('gitlab-com/www-gitlab-com')                            # Get a target project
> AutoMergeProcessWorker.perform_async(project.merge_trains.active.first.merge_request.id)    # Re-run the AutoMerge Process Job
Edited Oct 03, 2020 by Chad Woolley
Assignee
Assign to
Time tracking