Parallel execution strategy for Merge Trains

Problem to solve

https://gitlab.com/gitlab-org/gitlab-ee/issues/9186 introduces the concept of merge trains, but for the MVC we are only running them sequentially. To really reap the benefits of merge trains, we can optimistically build refs and run the pipelines in parallel, resulting in very fast merge train execution for scenarios where most pipelines are likely to succeed.

Intended users

All Development Teams

Proposal

Each MR that joins a merge train joins as the last item in the train, just as it works in the current state. However, instead of queuing and waiting, each item takes the completed state of the previous (pending) merge ref, adds its own changes, and starts the pipeline immediately in parallel under the assumption that everything is going to pass. In this way, if all the pipelines in the train merge successfully, no pipeline time is wasted either queuing or retrying. If the button is subsequently pressed in a different MR, instead of creating a new pipeline for the target branch, it creates a new pipeline targeting the merge result of the previous MR plus the target branch. Pipelines invalidated through failures are immediately canceled and requeued.

With this iteration, we also remove the problem with merge trains being one-at-a-time, therefore we can remove the option to enable merge trains and have this strategy be the default. The user should be able to select "one at a time" as the strategy if they want but this should not be the default. The reason we allow the strategy as a choice are for use cases such as:

https://gitlab.com/gitlab-org/gitlab-ce/issues/20481 (as a kind of branch-based approach to limiting pipeline concurrency)
general sensitivity to cost of running parallel pipelines
- related to above, high likelihood of pipeline failures which would result in wasted resources
- also related to above, very low traffic repo where little is gained through optimistic parallelization

Furthermore, in the future we may add additional strategies (https://gitlab.com/gitlab-org/gitlab-ee/issues/11222#other-potential-strategies), and this would be the natural place to select those.

Working Example

For example, if four MRs are queued together, this is what the refs their pipelines build (in parallel) would look like:

MR1
MR1+MR2
MR1+MR2+MR3
MR1+MR2+MR3+MR4

It's important to note that, given this composition, it's clear that they must never be allowed to merge out of order, even if an earlier one somehow finishes earlier. Also, if any item fails, the train will need to be recalculated and restarted with the first non-failing item as the first in the train. For example, if MR2 (pipeline 2) failed (or is removed or canceled), all running pipelines for the merge train will be canceled as invalid, and a new one built containing:

MR3
MR3+MR4

In this scenario, MR1 will have already merged so it is no longer in play. MR2 is known to be broken, so is not added back to the merge train. MR2 could potentially resolve its issues, and queue back up as 3. MR3+MR4+MR2.

If the target branch is updated by someone directly, bypassing the merge train, all pipelines are recalculated, immediately cancelled, and restarted using adjusted refs. People should not be committing directly to the target branch if they are using merge trains, since it invalidates the whole thing. We won't block this in the case of emergencies, but it is definitely an exception and not normal use case.

Permissions and Security

We don't limit the maximum length of a merge train, although we will limit the maximum parallelization. For the MVC we set the parallelization factor to 4. In future iterations this can potentially be tuned or made configurable, though our desire is to limit configuration options as much as possible.

Documentation

We will need to update the merge trains documentation to describe how this new strategy works. We are also changing the default strategy from one-at-a-time, so users of that strategy will need to know how to change back.

Testing

Particular testing focus should be around making extremely sure that out of order merges do not happen.

What does success look like, and how can we measure that?

We should measure # of pipelines that are part of a merge train.

What is the type of buyer?

This feature is used by all developers.

Links / references

Slack: #f_merge_trains

Product Design

The pipeline for merge train should have the copy modified:


Example UI

Copy
Pipeline #000 failed for merge train into [target-branch]
Pipeline #000 passed for merge train into [target-branch]
Pipeline #000 passed with warnings for merge train into [target-branch]
Pipeline #000 skipped for merge train with [target-branch]

Other Potential Strategies

Generate All Outcomes

Another variation of Hybrid above would be to kick off pipelines for every possible combination of outcome scenarios in the merge train, for each pipeline. If there are 3 pipelines in the train, the 4th pipeline will start the following pipelines:

1+2+3+4 (in case all succeed)
1+3+4 (in case 2 fails)
1+4 (in case 2 and 3 fail)
4 (in case 1 2 and 3 fail)
.. and so on.

You'd probably need some depth limit at some point, but this would guarantee a timely result for whatever scenario occurs. For certain scenarios where compute is less expensive than broken time in the target branch, this is the optimal approach. It sounds unusual but we've had some customers where time is worth much, much more than money (and they have significant resources available) for whom this is the preferred strategy.

Uber-style dependency analysis

Implement analysis of relationships in code to determine pipeline order. This is an advanced technique and would require significant understanding of the code checked in to the repo. Details on Uber's implementation available at https://eng.uber.com/research/keeping-master-green-at-scale/

Edited Jul 05, 2019 by Rayana Verissimo