Add trie-indexed pattern matching for CODEOWNERS

What does this MR do and why?

Contributes to #594073 (closed)

Problem

Gitlab::CodeOwners::File#entries_for_path uses a brute-force O(P × S × E) algorithm, calling File.fnmatch? for every pattern in every section for each changed file path. For GitLab's own CODEOWNERS (2,127 patterns across 82 sections), a merge request touching 500 files triggers over 1 million fnmatch calls, taking ~500ms.

Solution

Introduce Gitlab::CodeOwners::PatternIndex, which builds a per-section directory-trie index from parsed CODEOWNERS data. 97% of patterns are anchored directory/file prefixes that can be resolved via O(depth) trie walks instead of linear scans. Only the ~3% of patterns with glob characters in their prefix fall back to fnmatch.

Benchmarked against the real GitLab CODEOWNERS file and 99K repository paths: fnmatch calls reduced by 96.8%, wall-clock speedup of 3.4× (698ms → 208ms for 500 paths). One-time index build cost is ~4ms, amortized after ~12 path lookups.

How the PatternIndex optimization works

Imagine the CODEOWNERS file with 2,000+ rules like:

/app/models/      @backend-team
/app/views/       @frontend-team
/ee/lib/gitlab/   @ee-team
*.rb              @ruby-owner

Before (brute force): For every changed file in your MR, we open the CODEOWNERS and check every single rule to see if it matches. Change 500 files? That's 500 × 2,000 = 1,000,000 checks.

After (trie index): We build a lookup tree from the folder structure of the rules, once:

(root)
 ├── app/
 │    ├── models/   → @backend-team
 │    └── views/    → @frontend-team
 └── ee/
      └── lib/
           └── gitlab/ → @ee-team

Now to check app/models/user.rb, we just walk rootappmodels (2 lookups) and immediately find the matching rule. We never even look at the 1,998 other rules.

The ~3% of rules that use wildcards (like *.rb) can't be organized into folders, so those still get checked the old way - but that's 63 checks instead of 2,000.

Result: 97% fewer checks, 3.4× faster.

Benchmark

Results against the real .gitlab/CODEOWNERS (82 sections, 2,127 patterns) and real repository paths (99,491 files):

Paths Brute-force (per path) PatternIndex (per path) Speedup
10 2.566ms 1.409ms 1.8×
50 0.725ms 0.387ms 1.9×
200 0.520ms 0.168ms 3.1×
500 0.472ms 0.139ms 3.4×

Correctness verified for all path counts — both approaches produce identical results.

Benchmark script and full output

Save as tmp/benchmark_codeowners_real.rb and run with bundle exec ruby tmp/benchmark_codeowners_real.rb from the project root:

#!/usr/bin/env ruby
# frozen_string_literal: true

require_relative '../config/environment'
require 'benchmark'

puts "=" * 70
puts "CODEOWNERS Pattern Matching Benchmark (real classes)"
puts "Brute-force vs PatternIndex"
puts "=" * 70
puts

codeowners_content = File.read(Rails.root.join('.gitlab', 'CODEOWNERS'))

BenchmarkBlob = Struct.new(:data, :path, :repository) do
  def binary?
    false
  end
end

def make_file(content)
  blob = BenchmarkBlob.new(content, 'CODEOWNERS', nil)
  Gitlab::CodeOwners::File.new(blob)
end

file = make_file(codeowners_content)
total_sections = file.parsed_data.size
total_patterns = file.parsed_data.values.sum { |s| s.size }
puts "Sections: #{total_sections}"
puts "Total pattern entries: #{total_patterns}"
puts

paths_all = `git ls-files --full-name`.lines.map(&:chomp).reject(&:empty?)
puts "Total files in repo: #{paths_all.size}"

srand(42)

[10, 50, 200, 500].each do |path_count|
  sampled = paths_all.sample(path_count)

  puts
  puts "-" * 70
  puts "#{path_count} paths x #{total_sections} sections x #{total_patterns} patterns"
  puts "-" * 70

  file_check = make_file(codeowners_content)
  mismatches = 0
  sampled.each do |p|
    bf_count = file_check.send(:entries_for_path_brute_force, p).size
    pi_count = file_check.send(:pattern_index).entries_for_path(p).size
    if bf_count != pi_count
      mismatches += 1
      puts "  MISMATCH: #{p} -- bf=#{bf_count}, pi=#{pi_count}" if mismatches <= 3
    end
  end
  puts mismatches > 0 ? "Correctness: FAILED (#{mismatches} mismatches)" : "Correctness: PASSED"

  iterations = 3

  Benchmark.bm(22) do |x|
    x.report("brute-force") do
      Feature.disable(:optimized_codeowners_pattern_matching)
      iterations.times do
        f = make_file(codeowners_content)
        sampled.each { |p| f.entries_for_path(p) }
      end
    end

    x.report("pattern-index (trie)") do
      Feature.enable(:optimized_codeowners_pattern_matching)
      iterations.times do
        f = make_file(codeowners_content)
        sampled.each { |p| f.entries_for_path(p) }
      end
    end
  end

  Feature.disable(:optimized_codeowners_pattern_matching)
  bf_time = Benchmark.measure do
    iterations.times do
      f = make_file(codeowners_content)
      sampled.each { |p| f.entries_for_path(p) }
    end
  end

  Feature.enable(:optimized_codeowners_pattern_matching)
  pi_time = Benchmark.measure do
    iterations.times do
      f = make_file(codeowners_content)
      sampled.each { |p| f.entries_for_path(p) }
    end
  end

  bf_per_path = (bf_time.real / (iterations * path_count) * 1000).round(3)
  pi_per_path = (pi_time.real / (iterations * path_count) * 1000).round(3)
  speedup = (bf_time.real / pi_time.real).round(1)

  puts
  puts "Per-path avg: brute-force=#{bf_per_path}ms, pattern-index=#{pi_per_path}ms"
  puts "Speedup: #{speedup}x"
end

Feature.disable(:optimized_codeowners_pattern_matching)

puts
puts "=" * 70
puts "Done"
puts "=" * 70

Full output:

======================================================================
CODEOWNERS Pattern Matching Benchmark (real classes)
Brute-force vs PatternIndex
======================================================================

Sections: 82
Total pattern entries: 2127

Total files in repo: 99491

----------------------------------------------------------------------
10 paths x 82 sections x 2127 patterns
----------------------------------------------------------------------
Correctness: PASSED
                             user     system      total        real
brute-force              0.098039   0.026772   0.124811 (  0.125673)
pattern-index (trie)     0.047715   0.009455   0.057170 (  0.056699)

Per-path avg: brute-force=2.566ms, pattern-index=1.409ms
Speedup: 1.8x

----------------------------------------------------------------------
50 paths x 82 sections x 2127 patterns
----------------------------------------------------------------------
Correctness: PASSED
                             user     system      total        real
brute-force              0.110403   0.008118   0.118521 (  0.118273)
pattern-index (trie)     0.054229   0.001329   0.055558 (  0.055411)

Per-path avg: brute-force=0.725ms, pattern-index=0.387ms
Speedup: 1.9x

----------------------------------------------------------------------
200 paths x 82 sections x 2127 patterns
----------------------------------------------------------------------
Correctness: PASSED
                             user     system      total        real
brute-force              0.298528   0.001745   0.300273 (  0.299960)
pattern-index (trie)     0.101122   0.006638   0.107760 (  0.107573)

Per-path avg: brute-force=0.52ms, pattern-index=0.168ms
Speedup: 3.1x

----------------------------------------------------------------------
500 paths x 82 sections x 2127 patterns
----------------------------------------------------------------------
Correctness: PASSED
                             user     system      total        real
brute-force              0.693829   0.004603   0.698432 (  0.698119)
pattern-index (trie)     0.203061   0.005080   0.208141 (  0.207814)

Per-path avg: brute-force=0.472ms, pattern-index=0.139ms
Speedup: 3.4x

======================================================================
Done
======================================================================

References

  • Feature flag: optimized_codeowners_pattern_matching (gitlab_com_derisk, disabled by default)
  • Feature issue: #594073 (closed)
  • Rollout issue: #601529

Screenshots or screen recordings

Not applicable (backend-only change, no UI).

How to set up and validate locally

  1. Enable the feature flag:
    Feature.enable(:optimized_codeowners_pattern_matching)
  2. Open any merge request with CODEOWNERS approval rules
  3. Verify that the approval rules resolve correctly
  4. Run the benchmark script above to confirm performance improvement

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Vasilii Iakliushin

Merge request reports

Loading