Skip to content

advanced search ReDOS in highlight for code results

Summary

Reported in customer ticket: https://gitlab.com/gitlab-com/enablement-sub-department/section-enable-request-for-help/-/issues/83

Very long lines with matches in code results can cause ReDOS when search results are parsed and highlighting is calculated

Steps to reproduce

I used https://json-generator.com/# to generate a 264 MB json file and removed all newlines from it so the JSON is on one line.

Example json file: xl-json-example.json

  1. create a public project
  2. upload the json file
  3. run this bash script, replacing GROUP_ID, PROJECT_ID, and SEARCH_TERM with appropriate values
  4. watch the CPU, the puma process CPU quickly climbs with a low rate of traffic
###  !/bin/bash

DURATION=60

###  Start time  
START_TIME=$(date +%s)

###  Loop until the duration is reached  
while [ $(($(date +%s) - START_TIME)) -lt $DURATION ]; do  
    curl 'http://gdk.test:3000/search?group_id=GROUP_ID&project_id=PROJECT_ID&scope=blobs&search=SEARCH_TERM'  &  
    # Wait for 1 second  
    sleep 1  
done  

What is the current bug behavior?

Very long lines with matches can cause catastrophic backtracking for finding the highlights from Elasticsearch

What is the expected correct behavior?

The regular expression should be rewritten to avoid catastrophic backtracking

Relevant logs and/or screenshots

The regular expression is located in ee/lib/gitlab/elastic/search_results.rb

        matched_lines_count = highlight_content.scan(/#{::Elastic::Latest::GitClassProxy::HIGHLIGHT_START_TAG}(.*?)\R/o).size

Possible fixes