Backend: Increase MAX_PATTERN_COMPARISON for rules glob matcher from 10k->50k
Problem to solve
Within devopssecure we are running into the upper limit for rules.exists
and rules.changes
MAX_PATTERN_COMPARISON
, which is currently set to 10,000.
In SAST, this causes customer pipelines to run more analyzer jobs than are necessary. This shouldn't cause functional problems in most cases, but is undesirable—at best it causes more images to be pulled down onto runners just to do nothing and be confusing.
This is because security scanning jobs use rules to determine which scans to run in which projects. For example to run our previous JavaScript scanner we used the following (simplified) job definition that targets only files with JavaScript:
eslint-sast:
extends: .sast-analyzer
rules:
- if: $CI_COMMIT_BRANCH
exists:
- '**/*.html'
- '**/*.js'
Unfortunately, the current limitation means that any project with over 10k files will always trigger this job. Even our own gitlab-org/gitlab
project has 35k files with a shallow clone:
❯ cd gitlab
❯ find . -type f | wc -l
34494
From conversation with the initial implementor, the original maximum was chosen arbitrarily: #220983 (comment 376130220). I'd propose we increase this limit to better accommodate real world project sizes.
Workaround
For SAST, the workaround is to manually disable analyzers that you don't want to run. You can set this variable in a project or at a higher level (for instance, at the group level, in a compliance pipeline, or in a scan execution policy).
User experience goal
More accurate rules.exists
matches against repository file contents, less falsely triggered jobs.
Proposal
- Increase
MAX_PATTERN_COMPARISON
to 50k. 50k is also pretty arbitrary but covers thegitlab-org/gitlab
project as a baseline. - Log when 50k is exceeded.
- Additionally, log how many unique caches per pipeline take place and based on the logs, lower the max according to that data.
Documentation
Update rules.exists
and rules.changes
usage docs to mention new limit
Availability & Testing
There is risk in increasing this limit since the traversal happens outside of a pipeline context to determine whether to spawn a given build. While path traversal is not a particularly expensive operation, we should consider running some benchmarks to measure the impact of the increased maximum.
What does success look like, and how can we measure that?
More accurate rules.exists
matches against repository file contents, less falsely triggered jobs for large projects.