advanced search ReDOS in highlight for code results
Summary
Reported in customer ticket: https://gitlab.com/gitlab-com/enablement-sub-department/section-enable-request-for-help/-/issues/83
Very long lines with matches in code results can cause ReDOS when search results are parsed and highlighting is calculated
Steps to reproduce
I used https://json-generator.com/# to generate a 264 MB json file and removed all newlines from it so the JSON is on one line.
Example json file: xl-json-example.json
- create a public project
- upload the json file
- run this bash script, replacing
GROUP_ID
,PROJECT_ID
, andSEARCH_TERM
with appropriate values - watch the CPU, the puma process CPU quickly climbs with a low rate of traffic
### !/bin/bash
DURATION=60
### Start time
START_TIME=$(date +%s)
### Loop until the duration is reached
while [ $(($(date +%s) - START_TIME)) -lt $DURATION ]; do
curl 'http://gdk.test:3000/search?group_id=GROUP_ID&project_id=PROJECT_ID&scope=blobs&search=SEARCH_TERM' &
# Wait for 1 second
sleep 1
done
What is the current bug behavior?
Very long lines with matches can cause catastrophic backtracking for finding the highlights from Elasticsearch
What is the expected correct behavior?
The regular expression should be rewritten to avoid catastrophic backtracking
Relevant logs and/or screenshots
The regular expression is located in ee/lib/gitlab/elastic/search_results.rb
matched_lines_count = highlight_content.scan(/#{::Elastic::Latest::GitClassProxy::HIGHLIGHT_START_TAG}(.*?)\R/o).size