Note that one day one of the launch, we are temporarily limiting the
maximum number of jobs that a single job can need in the needs: array. Track
our infrastructure issue
for details on the current limit.
This limits the following types of situations, where the number of a jobs that a single job wants to depend on is over the limit:
my_job:needs:[dependency1,dependency2,...]...
As such, we plan to roll it out as follows:
On the 22nd the feature will be shipped with a small limit (5). If someone goes over the limit, they will receive a pipeline creation error (rspec: one job can only need 5 others, but you have listed 6. See needs keyword documentation for more details)
@ayufan will monitor over a few days to determine the performance characteristics.
If the feature is performant, we will set the feature to a large limit (50). It will remain there for the next monitoring period.
If we see serious performance issues, we will disable the feature entirely. Users pipelines that use the DAG will still run, but will follow stage sequencing.
The feature flag for toggling between 50 and 5 is ci_dag_limit_needs; it will limit to 5 when enabled. The feature flag for turning the feature on or off completely is ci_dag_support. We will need infrastructure support for the monitoring and toggling of necessary feature flags per the above plan.
Once the limit has been removed, we should update the needs: section in the yaml docs to remove the reference to this issue.
@ayufan can you add feature flag names, other technical details the infrastructure team will need such as what to monitor? Also, the specific limits you're going to set at each point.
Is the limit going to go away after gitlab-org/gitlab-ce!31768? And is this just a limit for GitLab.com or is this going to become a limit in any of the standalone products? We have an Ultimate license.
Recall our use case here for building a package DAG for Spack CI. The limit of 5definitely breaks our use case. The limit of 50 probably breaks it for our largest packages.
If I look through our largest packages, I see dependency counts like this:
So, the largest package (CERN's ROOT analysis tool) can have 58 direct dependencies (packages are templated so it may not always have this many). Even a limit of 50 would rule out building this.
If I consider transitive dependencies, things get more complicated. We currently pass dependency binary packages as artifacts through GitLab. So, I think our jobs could conceivably need ~500 other jobs, if that's required to get the artifacts. I haven't dug quite this deep into the new feature yet.
Does GitLab pass along artifacts from transitively needed jobs? If not, is it still a scalability/performance issue if I need the direct dependencies and add dependencies rules for transitive dependencies?
Ok, and, does a job get artifacts for transitive needs, or just direct needs?
If not transitive, we would need this to support 500 needs for our use case (maybe more eventually). That, or we need to pass artifacts between jobs outside gitlab and only use needs for control flow. We have not been able to do this using S3 in the past as it’s not consistent, but we could maybe use a shared filesystem.
Adding another question here for @ayufan - maybe could work with @ahanselka ? If this is still a feature flag which limits to 5 at the moment, can we try changing the flag on GitLab.com before we look at !22706 (closed) which is now related?
The introduction of cross-project includes has further confused the limits on needs:, cross-project artifacts referenced through needs: are limited to 5 currently. I've created #207329 to find a resolution or better document why this is different.
We also needs hundreds of needs in our dependency networks. We are managing a software distribution and have packages with 500+ transitive dependencies. A limit of 1000 would likely be sufficient, but ideally we’d be able to increase it as administrators instead of just being able to choose between 5 and 50, both of which are way too small. cc @mterhar
Could we enable ci_dag_limit_needs feature flag that bumps the limit to 50 now and monitor? If everything is fine we could mark this as default_enabled: true and ship with %13.2. Once done that we could try to increase a limit even further, to something like 100.
@ayufan the requirement is whatever the highest value is that is safe for the system. :) We have users with use cases up over 1000, so we should consider supporting a higher limit for self managed if some users want to also.
In the end it is @thaoyeager's call when to schedule and what to set it to but I do think it is "whatever the highest engineering says is ok."
I'm now thinking. It might be fine to increase the limit of needs, but we should ensure that needs: are bulk-insert. I'm quite certain that they are not today. So, I would say that 50 to be a safe for today, but with the expectation that we add bulk insert. As this will result in explosion of SQL queries with a significantly larger amount of needs:.
Does it make sense for the default value to be set based on gitlab.com?
I think it would be fine to make gitlab.com's limit 50 if that is what is needed, but we already hack our own GitLab instance to run with a limit of 1000, and it works fine so far for pipelines with hundreds of needs.
Given that most on-prem instances do not have hundreds of thousands (millions?) of users like gitlab.com, could the default for on-prem instances be set to 1000? It seems like gitlab.com is really the special case here.
@tgamblin I'm up for slowly increasing the limit and measuring the impact to ensure that there's parity between default and gitlab.com. The whole reason for doing atomic processing was to have that room :) I'm just being careful as I'm worried about the amplification effect, but we can solve that while increasing the limit.
@thaoyeager we are still waiting on the limits increase too for the www-gitlab-com repo. We are currently having to fall back to using dependencies because we have more than 10 dependencies, which results in slower pipelines, because it waits on the entire prior stage to finish.
Once done that we could try to increase a limit even further, to something like 100.
100 is insufficient for managing dependencies in large software stacks, which I'm pretty sure is a requirement both for us and for people with monorepos.
What is the requirement for www-gitlab-com?
Could we enable ci_dag_limit_needs feature flag that bumps the limit to 50 now and monitor?
It is concerning that the limit would be different on gitlab.com vs. private instances, and that the limit can be different across private instances. For our use case, we're really trying to design a turn-key build system that any site with GitLab can use. At the moment, we run into issues with portability and we have to ask each site's administrators to increase their needs: limit. The fact that it currently requires patching the code makes this even more problematic, as we have to explain to site admins why this isn't just a regular configuration parameter, and they often have security policies that prevent us from using patches. That's a lot of friction just to use the DAG feature.
In my ideal world I could just tell people to run our in any gitlab instance, but for now we have to do a bunch of configuration.
It's odd to me that the overhead of a pipeline would ever scale with the number of dynamic needs vs. the number and intensity of jobs in the pipeline. It should be possible to design an efficient pipeline that does not have overheads that scale with needs, and so that there would be no need for a limit. I hope that GitLab gets there soon, as this limit makes the needs: feature extremely hard to use in practice.
There's overhead that scales linearly with amount of needs, this is related to need to recalculate a number of distinct statuses. This is not a problem for stages, as you cannot really have easily 1000 different stages specifications. However, it is possible to have each build to have a different needs specified. Since the individual builds might not share the needs, you affectively have a complexity of O(n*m), where n - number of builds, m - number of needs each build has. Now, lets consider that you have 1000 builds, each having 1000 needs, this in a worst case can require traversing a 1M nodes to ensure that all statuses are accurate. There's also a space requirement to store all of them.
@tgamblin@ayufan if we supported defining groups and needing the group, would this get you further? You could maybe implement this by allowing for needing stages:
Something like this might allow for reducing computation as well as the count of one to one relationships (presumably, needing one stage would count as 1 towards the DAG limit).
@jyavorska: in our case it probably would not. We declare needs: on transitive dependencies in the DAG. So, given jobs:
A / \ B C \ / \ D E \ / F
A needs artifacts from B, C, D, E, F.
B needs artifacts from D, F.
C needs artifacts from D, E, F.
D needs artifacts from F.
E needs artifacts from F.
F doesn't need anything.
The sets of needs are not distinct, so I do not think a stage would work.
We discussed with @mterhar the possibility of having a way to declare that a need is transitive. In that scenario the graph above would just say:
A needs transitive B, D
B needs transitive D.
C needs transitive D, E.
D needs transitive F.
E needs transitive F.
F doesn't need anything.
The issue here is that we may want to omit some of the needs, and this would get us more artifacts than we care about. e.g., if E and F are build dependencies of C and D, and C and D are already built, C and D wouldn't need E and F. Neither would any of their dependents. So for that case you'd have:
A needs artifacts from B, C, D.
B needs artifacts from D.
C needs artifacts from D.
D, E, and F have no needs and can just run in parallel.
Whether you get this DAG or the first one depends on our DAG generation phase and on what's already built (the state of our binary package cache).
We decided that specifying the needs explicitly and just generating what we needed was simpler than trying to address these types of relationships in any sort of core GitLab support. So, we just need lots of needs :).
IMO a cap on number of parallel jobs within a single pipeline (currently capped at 50) makes sense for Shared Runners, but not self-hosted ones that the user hosts themselves (e.g. BYOD Runners). Any chance we can consider limiting Shared Runners to 50 but either increase or remove the cap for self-managed Runners? CC/ @dawsmith@rhyry1
Update on my previous comment. @rhyry1 and myself have a prospect for which the limit on number of parallel jobs is a Blocker. They would love a configurable limit, and may be willing to pay for additional capacity. They are making a decision within 2 weeks, Opportunity is here: https://gitlab.my.salesforce.com/0064M00000XYkkj. CC/ @dsakamoto
@ayufan we have #197886 (closed) (and even an MR !22706 (closed) that brings it to 25.) The issue has no limit specified since at the time it was unclear what to pick, but we can choose one and schedule it.. or just take the community contribution.
As discussed, 50 is still quite low, but could the parameter at least be made an integer in the near term?
The fact that it is not customizable (it’s either 5 or 50 right now) means we have to patch the code to make our self-hosted instance have a limit of 1000. Admins at DOE sites really hate this. If it were a configurable option it would be much easier for us to have sites override the default.
@jeanduplessis I think increasing limit is one thing, but we need to at least fix this: !36815 (merged). I think it should be quite OK then to enable higher limit for DAG that we can control via Feature Flag.
We've essentially got tools to build a small distro in a GitLab pipeline, with all dependencies represented. But we can't recommend that users try this on gitlab.com, because the limits are too small. This makes it hard to get real traction for what we've built, and we keep having to tell new users of GitLab to change their local needs limits, as well.
A limit of 1,000 would be nice and would probably cover most of the things we'd want to do before splitting into multiple pipelines. A limit of 10,000 would probably solve this once and for all.
I'm speaking specifically from the perspective of performance: The limit of 1_000 needs, with 1_000 jobs allowed results in a pessimistic case of 1_000_000 of edges in DAG. I don't think we are optimised for that, yet.
Can you describe how many edges you expect to be supported?
@ayufan: I dug into this a bit to see what would be required. I plotted a histogram of the dependency counts for all 5600 or so of our packages:
The average number of dependencies per package is a bit over 4. The most direct dependencies any package has is 88 (this has grown a bit over time but not fast). So at first glance, a needs: limit of 200 would probably be fine. We aren't likely to (ever?) have a package with more direct dependencies than that.
With needs you can only download artifacts from the jobs listed in the needs: configuration.
Because we are building packages, we don't just need artifacts from direct dependencies -- we need artifacts from transitive dependencies (i.e., if -> is a dependency and A -> B -> C, A must have needs: for B and C, not just for B). So we have to put a needs: for everything preceding a given job. The distribution of total transitive dependencies for the same packages looks like this:
quite a few (579, more than 10% of packages) have > 50 transitive dependencies. The biggest package has 321. These big packages tend to be useful things (ML packages like tensorflow and pytorch, R packages for analysis, math libraries, etc.), so it's fairly common for them to end up in someone's software stack. This is where I'm getting my number of 1,000 -- 500 isn't too far off, and 1,000 needs: per job seems like a good upper bound.
I looked at a few stacks we've built and found that for a pipeline with 345 packages, there were around 6,600 transitive edges in the pipeline. This is where I am getting the 10,000 upper bound on total edges in a pipeline. There are only around 1,100 direct edges in this pipeline.
Does this help? I can try to characterize the package graphs more if you tell me what you want to know.
One question I have is whether it would be possible to allow for dependencies: on transitive needs:. If we had that, we would not require nearly as many needs: in the pipeline. Here is an example:
A:needs:["B"]# don't need to say needs: ["C"] b/c B already needs C.dependencies:["B","C"]# get artifacts from B and CB:needs:["C"]dependencies:["C"]C:...
This would allow for fine-grained control over which artifacts are downloaded, and needs: would only be for control flow.
Would that be easier to implement from a scalability perspective? Or does it make a difference?
I think this is exactly what I need. I like the concept of limiting amount of edges, and raising a needs limit, or requiring to fulfill one of them.
One question I have is whether it would be possible to allow for dependencies: on transitive needs:. If we had that, we would not require nearly as many needs: in the pipeline. Here is an example:
This is truly remarkable suggestion. I think we could allow that. In that model, how many you would need?
@ayufan: we'd still need lots of dependencies: (as we'd need to download artifacts from the same transitive dependencies), but I think we could make do with 200 needs: per job, and maybe 2,000 total needs: for an entire pipeline.
I don't think we'll exceed 200 needs: per job any time soon. I think 3,000 would be a good total limit on needs for a pipeline with this model. Currently we have pipelines with around 1200 edges, so doubling that and rounding seems reasonable. Obviously more is better, but at least I think this model scales linearly with jobs, not quadratically as transitive needs tend to.
Another question on dependencies, artifacts, and packages: the way we currently get around some of these limitations is to put intermediate artifacts in S3 or in the local filesystem. Is it reasonable to use GitLab package registries for this? Would that be faster than using artifacts? What we are implementing is really a build farm for a package manager, but I am not sure that the registry support in GitLab handles these types of build caches well. It seems designed for smaller sets of released packages, not lots of builds.
If we can define a maximum amount of DAG edges for the whole pipeline I think this is acceptable, and we can definitely raise an individual limit to be OR:
edges per node OR total edges per pipeline
As for the transient dependencies:. This is valid, I think we can optimize it to not count against limit.
In order to raise a limit on a individual node (even if a limit on total edges) we need:
performance test AtomicPipelineProcessing
measure memory usage for storing more ci_build_needs
measure impact on mini-pipeline graph (of pipeline list)
performance test the on a DAG graph (a new tab added to pipeline view)
@marknuzzo@samdbeckham@dhershkovitch For your review and relabelling for Pipeline Authoring. I think we can probably move to the regular gitlab project.
Our docs still reference this issue for the reasoning of the limit being set to 50. Are we still waiting on performance tweaks before raising this, or can we move forward with an increase?
@furkanayhan @lauraX - I don't know if you have any additional context around performance from the past discussions around the increase in this limit from the past few years.
Are there other things that should be factored into a change like this? Do we have data that provides a guideline as to what a recommended increase could look like?
@furkanayhan @lauraX The value 50 (or whatever it is) should come from the API. This means that whatever maximum we want to show should be given by the API even when no limit has been explicitly set to facilitate changing the default in the API only.
Also, if the limit can be increased in the admin panel, do we have our own internal limit? Meaning if an admin sets this to be 1000, do we want to prevent them from setting this? If so, we might need get that value from the API as well so that we can show that limit in the description.
The alternative would be for the API to just return an error if the limit is too high, but that's a less ideal UX. WDYT?
@furkanayhan Just to make sure, do you mean "no, there is no limit?" as in someone could set that number to 1 000 000 000 ? If so, then when have a limit at all?
@f_caplette out of curiosity - is there a "limit" to how many needs the DAG graph can support before it becomes unwieldy? I remember some chats a while ago discussing the possibility of a different UI for B-I-G graphs
@lauraX We'd have to test again the number given how much has changed, but having 1000 needs was super slow to show and if I remember correctly. Now the user needs to explicitly enable the lines, so at least there is control there, but still we would need to validate how many lines we can show and we might have to disable the toggle if there are too many.
I don't think the users are really looking at dependency lines for large pipelines -- or at least we're not.
It would actually be more useful for large pipelines if the UI just showed statistics about how many jobs are waiting, running, finished, and errored (and maybe a percent done). The UI could also provide a list of the top n jobs and a search widget so that users could search for subsets.
This would actually be better than the current view, as right now I have to scroll around to find the jobs I'm interested in.
+1 I don't care much for the dependencies UI - I rather have the functionality of adding more than 50 needs. Right now we're bypassing the limitation by using dependencies - similar to what gitlab-foss pipeline does - but that either means some jobs start later (because they wait for all previous stages to finish) or we need to refactor our pipeline that has more than 150+ jobs already, which is also not ideal
@tgamblin@fmagalhaes_singlestore I am sorry to have taken so long to get back to you both on this (Vacations and life ), but I'd like to investigate this use case with you. I have been interested in this problem for a while now and finding alternatives to the pipeline graph for large pipelines.
The pipeline graph is liked for a lot of our users and I do think it has some benefits. However, although I cannot comment on all users with large pipelines, I can at least say that we hear about more friction from them. The navigation of the pipeline graph, no matter how refine, cannot accommodate large pipelines and end up causing more friction than intended. The other problem is performance, where showing the dependencies in a view is very costly, especially so in larger graph which is ironic given that they are also those less likely to benefit from this view
Having a list/table format for the graph would eleminate the restriction of 100 because we could do progressive loading.
I would have questions for both of you if you are okay with that
Would you want to completely replace the pipeline graph with a more "boring" view with a list/table of all jobs or have access to both next to each other for different purposes (like in a different tab)?
Would adding a search field (like we have in `/jobs) be sufficient for this use case?
What would you like to be able to filter/search by? Job name, status, stage? More than that?
Would you need to see a list of dependencies per job even in that format? Like a Column needs with a list of all jobs it needs and their status?
Would you prefer to see this as a table where each column is a stage?
Large GitLab Ultimate customer is interested in this feature:
Feedback: I have a pipeline where I want to create a dependency on approximate 75 jobs as this will prevent a user from accidentally running jobs in the wrong order, but I currently receive an error due to the limit.
Mark Nuzzochanged title from Rollout limits for directed acyclic graph on gitlab.com to Backend: Rollout limits for directed acyclic graph on gitlab.com
changed title from Rollout limits for directed acyclic graph on gitlab.com to Backend: Rollout limits for directed acyclic graph on gitlab.com
The problem is data structure and performance of computation. If there's 100 jobs each having 100 needs, it will generate 10k entries for each pipeline, as the needs is not normalized.
We don't really need a lot to significantly improve this, and have even 1000 needs if needed. We would need to introduce a way to normalize needs processing: all similar specifications should be re-used, when unique spec is used be calculated only once.
It could be as simple as introducing ci_builds.needs_spec_id , and store all unique needs in a separate table, can be even jsonb. Then we would calculate based on needs_spec_id.
Yes, we would add this to ci_builds, as this is mean to join other table, and ci_builds_metadata is not a good place for such information.
maybe exploring on #219212 would be a way for users to define virtual groups of needs (even large groups with more than 50 jobs), and for us to take the advantage of normalization by storing and processing groups instead of single job-job links. The idea is not very different than needs_spec_id above. The main difference is that the grouping is explicit and we could even visually present that.
Also, we could eventually treat stages as groups of needs where a job needs the some groups (stages).
It is the same problem, but inverted. You need to allow to depend on many groups, so, you will not really have a single spec_id on side of build as mentioned, but rather many group_ids.
@furkanayhanCi::CreatePipelineService does not run Ci::ProcessPipelineService directly, it does that async so I'm assuming Ci::InitialPipelineProcessWorker hasn't run.
Are there any pending_builds given that the pipeline hasn't been processed yet? I'm assuming all the jobs are created initially.
Are we benchmarking the pipeline processing going through test1 and test2 job groups? Or only the build jobs group?
It would be interesting to see the effect of having 10 and 100 needs per job during pipeline creation, how it affects the Chain::Create step.
@fabiopitino It may be a bit hidden in my comment so I'll ask again, have you seen the "Click to expand" section? There is the code that I use for the benchmark;
A way to denormalize/group needs together could be:
introduce a new table ci_job_needs (as we are already using needs for bridge jobs and we could eventually extend to GenericCommitStatus)
The table would have id (bigint), project_id (bigint) , needs (jsonb) and checksum (varchar) - with (project_id, checksum) being indexed.
For each job we normalize the needs specifications in their extended format: [ { job: "build1", artifacts: true }, ... ]
We calculate the SHA256 checksum of the specifications
We look into ci_job_needs table for a matching project_id, checksum - if found we use the id otherwise we insert a new record and take the id
We use the id above as needs_spec_id.
This approach has the advantage of persisting a lot less data by deduplicating needs across the same jobs overtime. The disadvantage is that it may be less performant to calculate checksums, compare and reuse needs specs for possibly thousands of jobs.
Another way could be:
introduce a new table ci_job_needs (as we are already using needs for bridge jobs and we could eventually extend to GenericCommitStatus)
The table would have job_id (bigint), project_id (bigint) , partition_id (bigint), needs (jsonb) .
For each job we store the needs as JSON in a single record.
This approach has the advantage of being simple, the needs data is readily available with a single query and bulk insert of this data may be easier to do. The disadvantage is that we may still persist duplicate data across jobs, although much fewer records than today's normalized approach.
Hi all, is there an ETA for increasing this limit? Seems like some investigations were happening 3 months ago but there have been no follow-up updates.
@dhershkovitch do you have an update on this effort? I see that #397672 (closed) got closed as a duplicate of this issue but it's hard to tell what the progress is (it looks like the last comments are 4 months old).