Merge Train stuck without pipelines
Description of the event
When a MR was added to a Merge Train (first MR starting the train) the system notes in the MR showed:
Also no pipeline was created for merge train for this MR. Using the API, the MR however still appeared to be on the train but in idle
state:
API response for the given merge train car
{
"id": 1337560,
"merge_request": {
"id": 210871957,
"iid": 121316,
"project_id": 7764,
"title": "update slack channel name guidance",
"description": "<!-- Before proceeding, please check if you need to apply a specific MR description template from the dropdown menu above next to `Description` (e.g. blog post, website bug report). -->\n\n## Why is this change being made?\n\n<!--\nProvide a detailed answer to the question on **why** this change is being proposed, in accordance with our value of [Transparency][transparency].\n\nExample: `We have discussed the topic in Slack - (copy of Slack conversation). The current process is not efficient, this MR makes the description of X more clear, and helps move Y forward.`\n-->\n\n## Author Checklist\n\n<!-- Please verify the check list and ensure to tick them off before the MR is merged. -->\n\n- [ ] Provided a concise title for this [Merge Request (MR)][mr]\n- [ ] Added a description to this MR explaining the reasons for the proposed change, per [**say why, not just what**][say-why-not-just-what]\n - Copy/paste the Slack conversation to document it for later, or upload screenshots. Verify that no confidential data is added, and the content is [SAFE][SAFE]\n- [ ] Assign reviewers for this MR to the correct [Directly Responsible Individual/s (DRI)][dri]\n - If the DRI for the page/s being updated isn’t immediately clear, then assign it to one of the people listed in the `Maintained by` section on the page being edited\n - If your manager does not have merge rights, please ask someone to merge it **AFTER** it has been approved by your manager in [#mr-buddies][mr-buddies-slack]\n - The [when to get approval][when-to-get-approval] handbook section explains the workflow in more detail\n- [ ] If the changes affect team members, or warrant an announcement in another way, please consider posting an update in [#whats-happening-at-gitlab][whats-happening-at-gitlab-slack] linking to this MR\n - If this is a change that directly impacts the majority of global team members, it should be a candidate for [#company-fyi][company-fyi-slack]. Please work with [internal communications][internal-communications] and check the handbook for examples.\n\n<!-- Quick actions for assignment, labels, review requests. Please update them as needed. -->\n\n<!-- Assign yourself -->\n\n<!-- Apply labels: You can keep or remove `handbook` as needed, add other relevant labels, or remove this line. -->\n\n<!-- Assign reviewer(s), following https://about.gitlab.com/handbook/handbook-usage/#when-to-get-approval. Remove the [HTML comment tags](https://www.w3schools.com/tags/tag_comment.asp) to enable. -->\n\n<!--\n-->\n\n---\n\n<!-- DO NOT REMOVE -->\n[transparency]: https://about.gitlab.com/handbook/values/#transparency\n[mr]: https://docs.gitlab.com/ee/user/project/merge_requests/\n[say-why-not-just-what]: https://about.gitlab.com/handbook/values/#say-why-not-just-what\n[dri]: https://about.gitlab.com/handbook/people-group/directly-responsible-individuals/\n[SAFE]: https://about.gitlab.com/handbook/legal/safe-framework/\n[when-to-get-approval]: https://about.gitlab.com/handbook/handbook-usage/#when-to-get-approval\n[internal-communications]: https://about.gitlab.com/handbook/people-group/employment-branding/people-communications/\n[mr-buddies-slack]: https://gitlab.slack.com/archives/CLM8K5LF4\n[company-fyi-slack]: https://gitlab.slack.com/archives/C010XFJFTHN\n[whats-happening-at-gitlab-slack]: https://gitlab.slack.com/archives/C0259241C",
"state": "opened",
"created_at": "2023-03-13T16:12:30.351Z",
"updated_at": "2023-03-13T16:23:02.327Z",
"web_url": "https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_requests/121316"
},
"user": {
"id": 5667426,
"username": "spatching",
"name": "Sherrod Patching",
"state": "active",
"avatar_url": "https://gitlab.com/uploads/-/system/user/avatar/5667426/avatar.png",
"web_url": "https://gitlab.com/spatching"
},
"pipeline": null,
"created_at": "2023-03-13T16:23:01.328Z",
"updated_at": "2023-03-13T16:23:01.328Z",
"target_branch": "master",
"status": "idle",
"merged_at": null,
"duration": null
}
Other MRs were subsequently added to the same merge train which also was stuck.
After re-adding the first MR back to the train it got the train finally starting but since now the whole train had to be refreshed (with 20 pipelines running), the pipelines started failing with error fatal: remote error: GitLab is currently unable to handle this request due to load.
.
Problem
We created 2 system notes when the MR was added to the train, possibly the user clicked twice and we triggered 2 concurrent operations.
When we add the MR to the train we first [build the ActiveRecord model for MergeTrain
](When we add a MR on the train we do the following: https://gitlab.com/gitlab-org/gitlab/-/blob/39b0df75698d67a4a8030682928bf627af4fd76b/ee/app/services/auto_merge/merge_train_service.rb#L6), then persist the merge request together with the new merge train object associated, finally we send out the system note.
If we try to reproduce double addition of the MR to the merge train we can see that we maintain exclusively 1 merge train record by removing any existing one:
mr = MergeRequest.first
mr.build_merge_train(user: User.first, target_project: mr.target_project, target_branch: mr.target_branch)
mr.save! # correctly saves the first time
# Attempting to add the same MR to the train again
mr.build_merge_train(user: User.first, target_project: mr.target_project, target_branch: mr.target_branch)
# fires a query:
# MergeTrain Destroy (0.2ms) DELETE FROM "merge_trains" WHERE "merge_trains"."id" = 3
mr.save! # correctly saves but we previously removed 1 merge_trains record!
# The merge train car 3 disappeared and we didn't do anything to the train the state machine
MergeTrain.pluck(:id) # => [4]
This also explains why we could see 2 system notes plus 1 note about the merge request being removed from the train (due to error) while the merge request was still (again) on the merge train.
It could be that this caused the previous merge train record to silently disappear leaving the whole merge train state machine in an inconsistent state.
Proposal
Fabio's walk-through of build_merge_train
deleting the existing merge train is pretty good. A few changes that it seems like we should make:
- Add a simple guard to
AutoMerge::MergeTrainService#execute
to preventMergeRequest#build_merge_train
from being call is a MergeTrain already exists. We have existingabort
functionality, so this kind of AR-implied record destruction doesn't really mean anything in our product workflow. We should handle "try-to-build-when-already-exists" in a more controlled way. - Understand if there are workflows where we actually want to abort the current merge train and start a new one. Does that exist yet? Should there be a simple-readd operation where we abort and re-build? Or should we just end
execute
if themerge_train
is already there and assume that the first one existing is enough? - Check on whether or not this happens in background jobs, and consider using
Gitlab::ExclusiveLease
to ensure that the same merge request doesn't get operated on by two separate requests in parallel.