Improve CI rules:exists pattern matching performance with cache
What does this MR do and why?
In rules:exists
, if a file path is a pattern (including *
), the path
pattern is iterated through every single file in the repository via
File.fnmatch?
. This is an expensive operation so it is limited to
MAX_PATTERN_COMPARISONS
(10k for now) per pattern.
We want to increase this limit (#227632 (closed)) but we need to improve (#351593 (closed)) the performance problem first.
Here, we are adding a request-store caching per glob. With this, a glob is iterated only once. We can accept the fact that files do not change during a request or a worker processing.
This is behind a feature flag ci_rules_exists_pattern_matches_cache
(#457914 (closed)).
Related to #351593 (closed)
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Screenshots or screen recordings
The project has these files;
config/app.rb
hello1.yml
hello2.yml
and this CI config;
job1:
script: echo Hello, World!
rules:
- exists:
- 'docs/*.md' # does not match
- 'config/*.rb' # matches
job2:
script: echo Hello, World!
rules:
- exists:
- 'docs/*.md' # does not match
- '**/app.rb' # matches
job3:
script: echo Hello, World!
rules:
- exists:
- 'config/*.yml' # does not match
- '**/app.rb' # matches
job4:
script: echo Hello, World!
rules:
- exists:
- '**/app.rb' # matches
When the feature flag is enabled:
-
File.fnmatch?
runs fordocs/*.md
3 times (the number of project files) because it iterates all files once. -
File.fnmatch?
runs forconfig/*.rb
once because it iterates once and finds the file. -
File.fnmatch?
runs forconfig/*.yml
3 times (the number of project files) because it iterates all files once. -
File.fnmatch?
runs for**/app.rb
once because it iterates once and finds the file.
When the feature flag is not enabled:
-
File.fnmatch?
runs fordocs/*.md
6 times (2 x the number of project files) because it iterates all files twice. -
File.fnmatch?
runs forconfig/*.rb
once because it iterates once and finds the file. -
File.fnmatch?
runs forconfig/*.yml
3 times (the number of project files) because it iterates all files once. -
File.fnmatch?
runs for**/app.rb
3 times because it iterates once and finds the file three times.
Benchmark
# GITALY_DISABLE_REQUEST_LIMITS=true rails console
require 'benchmark'
ActiveRecord::Base.logger = nil
@project = Project.find(4) # test project with ~16k files
@paths = @project.repository.ls_files('master')
@glob = 'tests/test-*.c' # matches almost at the end of the list
def run_without_cache
@paths.any? do |path|
File.fnmatch?(@glob, path, File::FNM_PATHNAME | File::FNM_DOTMATCH | File::FNM_EXTGLOB)
end
end
def run_with_cache
Gitlab::SafeRequestStore.fetch("test123123_#{@glob}") do
@paths.any? do |path|
File.fnmatch?(@glob, path, File::FNM_PATHNAME | File::FNM_DOTMATCH | File::FNM_EXTGLOB)
end
end
end
::Gitlab::SafeRequestStore.ensure_request_store do
times_to_run = 1000
Benchmark.bm do |x|
x.report('without cache') { times_to_run.times { run_without_cache } }
x.report('with cache') { times_to_run.times { run_with_cache } }
end
end; nil
user system total real
without cache 5.867844 0.011283 5.879127 ( 5.897253)
with cache 0.010458 0.000036 0.010494 ( 0.010554)