Merge trains for merge request pipelines
Problem to solve
#7380 introduces running a build on the result of the merged code prior to merging, as a way to keep master green. There's a scenario, however, for teams with a high number of changes in the target branch (typically
master) where in many or even all cases, by the time the merged code is validated another commit has made it to master, invalidating the merged result.
We need some kind of queuing, cancellation or retry mechanism for these scenarios in order to ensure an orderly flow of changes into the target branch.
See also: https://github.com/bors-ng/bors-ng
After discussion (see comment threads) we landed on a hybrid approach being the best. This is defined as:
Hybrid Merge Train
This is an approach that forms a merge train, of which there can be only one per project. If the feature branch is green, the MR gets a button "Merge when target branch succeeds" or "Add as nth MR to merge train" (label depends on if a merge train has already formed, the latter being the button for "joining" a merge train.)
When the first MR clicks on "Merge when target branch succeeds", the pipeline and all jobs start running normally. If the button is subsequently pressed in a different MR, instead of creating a new pipeline for the target branch, it creates a new pipeline targeting the merge result of the previous MR plus the target branch. In this way, if all the pipelines in the train merge successfully, no pipeline time is wasted either queuing or retrying.
This approach was selected because it is the most balanced optimization, and works by assuming that most of the team, a green feature branch will merge fine, and treating a failure case as a more rare exception. In situations where the failure case is more common this will be less efficient than other options below. We will consider adding more operation modes for release trains in the future to handle these more niche but important use cases.
How Merge Trains are Constructed
Each MR that joins a merge train joins as the last item in the train. Each item takes the optimistic state for if the previous item succeeds, adds its own changes, and starts building immediately under the assumption (Hope) that everything is going to pass. For example, if four MRs are queued together, this is what they would look like:
Given this composition, it's clear that they must not be allowed to merge out of order. Also, if any item fails, the train will need to be rebuilt with the first non-failing item as the first. So, for example, if MR2 (pipeline 2) failed, all running pipelines for the merge train will be canceled as invalid, and a new one built containing:
In this scenario, MR1 will have already merged so it is no longer in play. MR2 is known to be broken, so is not added back to the merge train. MR2 could potentially resolve its issues, and queue back up as 3. MR3+MR4+MR2.
Other Strategies Discussed
Abort & Retry
One approach is to abort and current pipeline, update, and retry. For relatively rare cases of contention, this could work. Everything will eventually flow into the target branch after a short delay. In cases where there are many constant changes, though, this could build up into very long delays.
One at a Time
An alternative is to prevent the above situation in the first place - only allow one pipeline on merge result to run at a time, and queue the rest. This will ensure an orderly series of pipelines merging, but could also result in very long queues in busy repos.
Generate All Outcomes
Another variation of Hybrid above would be to kick off pipelines for every possible combination of outcome scenarios in the merge train, for each pipeline. If there are 3 pipelines in the train, the 4th pipeline will start the following pipelines:
- 1+2+3+4 (in case all succeed)
- 1+3+4 (in case 2 fails)
- 1+4 (in case 2 and 3 fail)
- 4 (in case 1 2 and 3 fail)
- .. and so on.
You'd probably need some depth limit at some point, but this would guarantee a timely result for whatever scenario occurs. For certain scenarios where compute is less expensive than broken time in the target branch, this is the optimal approach.
What does success look like, and how can we measure that?