It's possible we're actually optimizing for the wrong thing here, and probable that we're at least being too conservative. Developer time is much more expensive than runner minutes, and we should prioritize fast results in busy projects over the small incremental cost of runner minutes.
Intended users
Users of merge trains
Further details
Proposal
We should remove the limitation on concurrency, or at least set it to a "protect the system" value rather than something as small as 4 which is realistic under real-world conditions in busy projects.
Some factors that should be considered that might determine the final value:
How long have we seen the queue grow in production? That might give us an idea what a realistic limitation is that protects the platform but normal users won't bump into.
How often do merges fail in a merge train? If this is low, and it probably is, it makes a case for as high as possible setting.
@ogolowinski can you please review and consider prioritizing? It's probably very small, and would be a way to quickly help people get a lot more out of merge trains.
@ogolowinski At this moment, removing this concurrency limit would not be an option, as we're moving towards to introduce more application limits for making GitLab robust/scalable/resilient. Unlimiting concurrency could cause great impact when there are many MRs on the train e.g. 1000 MRs that it could create 999 pipelines at the same time when a pipeline failed. Even though if it's an edge case, we'd like to avoid letting it happen.
Users can change the concurrency from 1-10 (10 is the maximum).
The upper limit "10" is arguable, maybe it should be 100 or more. Regardless of the reasonable value, at first we'd like to build a functionality to allow users to set the value at project-level config (likely under the option "Merge pipelines will try to validate the post-merge result prior to merging" or different section).
I do not think we'll ever allow unlimited concurrency on gitlab.com, however, we can allow it for on-premises instances by introducing additional admin-level configuration to set the upper limit. So gitlab.com sets it 10, but on-prem instances can set whatever values like 1000.
@dosuken123 makes sense, that being said - we should be monitoring the number of configured concurrency pipelines, if the majority of the users are setting it to 10 - we probably should increase both the default and the max values
We should also consider making this higher priority - when we turned on www-gitlab-com merge trains, everyone felt they were too slow because everything was constrained to 4 at a time. It was made worse by not being able to tell where in the queue you were via the UX, but still this could be a blocker for realistic use. I hope we can turn back on merge trains before 12.5.
@dosuken123 @ogolowinski if most pipelines pass is there a reason not to optimistically run most of them? Aside from setting a reasonable limit to prevent runaways, what is the goal of limiting maximum usefulness of the feature?
@ogolowinski what I meant was, aside from setting reasonable limits to prevent runaways, why limit the maximum effectiveness of the feature apart from that? Or is 10 really the maximum that the system can support from a stability standpoint? That sounds surprising.
Can we start from taking a metric about how long MRs actually waited to be merged due to this limit? So we can confirm whether the new limit effectively resolves our problem?
@jlenny I agree with @dosuken123 I think we should set an arbitrary number and validate if it is the optimal one - Start with 10 and keep doubling it based on metrics and performance until we find the magic number
@ogolowinski it sounds like @dosuken123 is saying we should try to do something data-based, which I agree with. Is running 10 concurrent pipelines in a project a critical system failure state? I don't think so - I routinely see www-gitlab-com running 50 or 60 concurrent pipelines. If we turned on merge trains in that project all we would achieve is arbitrarily slowing them down to 1/6 of the rate that pipelines run today, for no clear reason as best I can tell since running 60 concurrent pipelines today apparently doesn't cause an incident. Merge trains on would be demonstrably a downgrade from having it off under normal operating procedures.
In any case, picking a number and doubling it until there's a problem doesn't feel right.
@ogolowinski I think at least we need some control on the parallel factor. The state of our production server varies time to time and even if 100 parallel concurrency works for a month, it suddenly turns into a production incident when too many pipelines are initiated accidentally. So it's hard to determine the best factor that works for all the time and hardcode the value in application is quite dangerous, because SRE cannot react/mitigate the incident immediately.
On gitlab.com, we evaluate the new factor for a week. If no stress on our production fleets is observed, keep the value as permanent factor.
The good thing is that we can reevaluate (e.g. increase it to 20) the factor in the future without putting an additional development effort. We can simply repeat 2. and 3..
Thanks @dosuken123, this seems like a good MVC. If I understand correctly, making this an instance-level configuration would make it possible to change without a re-deploy.
Please go ahead and get started on these changes since it's affecting dogfooding.
Can't we just set it to a safe system limit? Is there some reason we think that running 4 pipelines at a time is living on the edge and a potentially dangerous situation? Many projects run more than this all the time already.
@jlenny There is no obvious safe system limit works all the time. Basically, all of the CI config should have a configurable limit for mitigating any kinds of outage. For example, when we see an problematic behavior on the other part of CI feature and it introduces an outage, Merge Train could escalate the situation and make it worse by creating a lot of pipelines in a short interval. In such case, SRE should temporarily lower the limit until we've fixed the main cause, and gradually increase the limit again.
If we don't want to make the configuration on UI, we should at least allow SRE to control the value via rails console, rake task or feature flag.
This falls into concept of configurable application limits. They are not to for everyone consumption (aka administrators), but rather give us some ability to change aspects of systems dynamically: #34634 (closed). Ideally the limit should represent current capability of the system to run given workflow.
We are very close for having a clearly defined/and likely implemented in this release a way to define such limit as part of the above issue, as the need for having it central and configurable spans way outside of CI :)
Whatever we set it to, it should not be slower than not using merge trains under basic operation. During peak today we are seeing what used to be 10m pipelines take 40+ minutes waiting to squeeze through the 4 pipeline limit (at least, it was at 40 minutes before Sid gave up and just merged directly to master).
If we can't release the feature now in a state that doesn't cause usability issues, we should turn it back off until we can.
For now, we allow to set the high(20)/medium(10)/low(4) concurrency via feature flag. This is hardcoded so we cannot change the limit quickly, but it's the fastest way to mitigate www-gitlab-com case.
Sure, that's great. Being able to temporarily configure it internally as we roll it out is fine, adding a new configuration option for all our users to have to worry about is not.
The MR !19131 (merged) has been merged. We're waiting for it's being deployed on gitlab.com. I'll increase the parallel factor on www-gitlab-com once it's deployed.
I know this is ticket is old, but I found it linked on the GitLab blog.
Why not throttle how often the pipelines start and have each start pipeline contain all the MRs that it can.
For example, let's say you have an empty train then 10 MRs get added in quick succession. You could have 10 pipelines or you could have just one of those pipelines (the one that had all 10 MRs). If that one succeeds, then the prev 9 weren't necessary. If it fails then backtrack and issue the 9 pipelines. This is very useful if you have lots of MRs, few runners, and/or a long pipeline process.
I see that this issue references being able to configure the maximum parallel trains but I don't see this option anywhere? Is it hidden somewhere? I want to be able to limit the maximum concurrency, how would I do that?