Backend: Increase MAX_PATTERN_COMPARISON for rules glob matcher from 10k->50k

Problem to solve

Within devopssecure we are running into the upper limit for rules.exists and rules.changes MAX_PATTERN_COMPARISON, which is currently set to 10,000.

In SAST, this causes customer pipelines to run more analyzer jobs than are necessary. This shouldn't cause functional problems in most cases, but is undesirable—at best it causes more images to be pulled down onto runners just to do nothing and be confusing.

This is because security scanning jobs use rules to determine which scans to run in which projects. For example to run our previous JavaScript scanner we used the following (simplified) job definition that targets only files with JavaScript:

eslint-sast:
  extends: .sast-analyzer
  rules:
    - if: $CI_COMMIT_BRANCH
      exists:
        - '**/*.html'
        - '**/*.js'

Unfortunately, the current limitation means that any project with over 10k files will always trigger this job. Even our own gitlab-org/gitlab project has 35k files with a shallow clone:

❯ cd gitlab
❯ find . -type f | wc -l
   34494

From conversation with the initial implementor, the original maximum was chosen arbitrarily: #220983 (comment 376130220). I'd propose we increase this limit to better accommodate real world project sizes.

Workaround

For SAST, the workaround is to manually disable analyzers that you don't want to run. You can set this variable in a project or at a higher level (for instance, at the group level, in a compliance pipeline, or in a scan execution policy).

User experience goal

More accurate rules.exists matches against repository file contents, less falsely triggered jobs.

Proposal

Increase MAX_PATTERN_COMPARISON to 50k. 50k is also pretty arbitrary but covers the gitlab-org/gitlab project as a baseline.
Log when 50k is exceeded.
Additionally, log how many unique caches per pipeline take place and based on the logs, lower the max according to that data.

Documentation

Update rules.exists and rules.changes usage docs to mention new limit

Availability & Testing

There is risk in increasing this limit since the traversal happens outside of a pipeline context to determine whether to spawn a given build. While path traversal is not a particularly expensive operation, we should consider running some benchmarks to measure the impact of the increased maximum.

What does success look like, and how can we measure that?

More accurate rules.exists matches against repository file contents, less falsely triggered jobs for large projects.

Is this a cross-stage feature?

devopsverify

Edited Jun 10, 2024 by Furkan Ayhan - OOO until 29 July