Rules are expensive to parse and evaluate and even more expensive for parallel jobs where we transform the numbers into specific jobs. The current execution path is:
Process the YAML files into a big Hash structure
Process the Hash in the normalizer class to expand parallel job definitions into the actual jobs
Evaluate the rules for each job to determine if it should be created or not.
Proposal
Because parallel jobs share the same rules, we should be able to evaluate them once, cache and share the result across all the jobs from the pipeline that's being created.
Risks
If someone uses the variables exposed by the parallel config (like CI_NODE_INDEX), the rules result could return a wrong value, but why would anyone do that?
@fabiopitino@furkanayhan I think this could be another low hanging fruit to improve pipeline creation speed for gitlab-org/gitlab since we use lots of parallel jobs.
@mbobin It depends on how complex it will be to fix this for parallel jobs. I see it generally makes sense but technically speaking the set of variables for each parallel job are different and could lead to different rules results.
This is a forced example and not sure how much sense it makes but it's only to illustrate the point:
we could hash each part separately so that we can understand various scenarios:
the rule is shared across very different jobs and the rule references variables from the pipeline context.
the job definition is the same across pipelines and only the project context is used.
But this would require us to also have an understanding of what context is used at each level of caching:
is the rule using only project-level context? E.g. if: $GITLAB_FEATURES =~ merge_trains.
is the rule using only pipeline-level context? E.g. if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH.
is the rule using job-level context
Challenges:
changes (with rules:changes) affect the evaluation of rules and those would be hard to cache. We may give up on caching the entry if rules:changes is present.
Is this planned for only parallel or also parallel:matrix? If parallel:matrix is also affected, this won't be a viable solution because there are quite a few valid use cases for using the matrix variables in the rules.
@mbobin for example when you have a project that contains multiple individual projects that follow the same structure and ci needs. For example if you have a project containing helm charts.
For another use case I have two jobs for repository mirroring. Most projects have bidirectional mirroring, but for some I only want one-way mirroring. Instead of creating a new set of jobs for the one-way mirrors, I can just use rules and re-use the existing jobs:
I'm having a look at Ci::Config::Entry::Rules::Rule and thinking there might be a clever way to implement a generalized evaluated-rule caching strategy here. If we cooked up a ::Gitlab::Config::Entry::Cacheable module, it could look something like:
@furkanayhan@mbobin@drew hello all! I'm pinging you here because you've been most active in this issue. grouppipeline authoring has this issue scheduled for %17.6, but I'm worried that it's not ready to be worked on because of the possibility of breaking CI_NODE_INDEX (and other variables?). I see two scenarios ahead of us:
We try to implement this issue without breaking those variables, with the caveat that it may not be possible
We deprecate those variables, remove them in %18.0, and schedule this issue for %18.1
We close this issue as unworkable because it will break the matrix variables
I'm leaning towards 3 (closing this issue as unworkable) because there are other improvements we can make to pipeline creation performance, so I'd rather not spend time on this issue if it might not be possible. But…I'm not sure WDYT?
Eh, hard to say. @avielle Do you think we should leave this on ice until we get some kind of better measurement out of #469374?
It still sounds like a good idea in theory, but I haven't looked at this or thought about it since before the other caching improvements we've already made. So I have no idea what the current performance or potential benefit is.
At the very least, I would defer this and not work on it now.
@marknuzzo like #350057 (closed), this will speed up both the lint and pipeline creation endpoints, so I'm applying candidate17.3 to it. I'm also applying weight 4
!150805 (merged) : we added a cache for individual pattern globs of each exists clause.
!152843 (merged) : we added a cache for each whole changes clause.
We know the slowness of the rules from the pipeline_seed_build_inclusion metric. And, as far as we know, the slowness is from the exists and changes clauses. So, maybe our next steps can be adding a cache for each whole exists clause like changes.
However, we must ensure that the slowness comes from only exists and changes. So, maybe, we can add additional instrument blocks.
Besides, we added those caches only for File.fnmatch?, maybe the slowness is from ExpandVariables.expand_existing!
However, we must ensure that the slowness comes from only exists and changes. So, maybe, we can add additional instrument blocks.
This sounds like a good next step. WDYT about repurposing this issue to add the additional instrumentation?
Also, after seeing the improvements we've already made in this area and considering the slowness could come from the variables - which we already have issues to address - I'm not sure this issue is a priority. I'm going to remove the candidate17.3 label for now
WDYT about repurposing this issue to add the additional instrumentation?
@avielle I think we should create another issue to implement additional instrumentation. BTW, I also created #468649 to remove redundant instrumentation. cc @lauraXD
@furkanayhan you said in #450687 (comment 1966507937) that we need more instrumentation before we know if the improvements in this issue will be valuable Or did I misunderstand?
@avielle Yeah, I might have misjudged over there...
#469374 is about internal instrumentation of rules; what's slow? exists, changes, etc
In this issue, we try to reduce the duplication of calls. Honestly, we could do #469374 first to make this in perfect order but it's not a hard blocker. That's what I mean actually :)
In the mean time, I also think adding more instrumentation may impact the performance. Especially, calling it for every rule of every job.
In the mean time, I also think adding more instrumentation may impact the performance. Especially, calling it for every rule of every job.
@furkanayhan@avielle Is that how we would add instrumentation though? I feel like some performance sampling, even on internal projects only, would be enough to give us a layout of where the time is being spent.