Improve detection of sensitive information (!14) · Merge requests · GitLab.com / GitLab Support Team / toolbox / harcleaner

Michael Trainor requested to merge regex-redact into main Feb 26, 2024

Problem

Previously, harcleaner would only process specific known elements of the JSON file. We needed to be able to redact sensitive information based on pattern matching and keywords.

Solution

Influenced by the following projects:

The solution makes heavy use of Golang's reflect package. I don't pretend to understand Golang reflection sufficiently, so this implementation might be very clumsy.

The HAR's entries are now walked to ensure that each underlying struct or string is processed. This includes checking for the existence of keywords, and using regular expressions to search and replace known patterns such as tokens or credentials.

Specific elements of the HAR's entries are still handled similarly to how they were before, but these checks occur during the walking of the HAR entry's structs.

I removed the specific redaction of the QueryString and and Headers, as these are now handled by keyword match and redaction. These elements were fully redacted before out of caution, but now will be partially redacted, as some of the information in these fields will be useful.

Cookies are still always redacted, as we don't think they will be useful for our analysis.

Regular Expressions

This introduces the following regular expressions to search/replace:

key=value where the key matches a list of keywords
JWT tokens
GitLab tokens
user:password in URLs

We can easily add more regular expressions as we learn more about what to redact.

When multiple regular expressions are applied to a string, the result of each pattern match/replace is the input string to the next regular expression. This ensures that a string that matches multiple patterns will be redacted.

Approach

I decided to perform the regular expression search/replace operations on each individual struct's string field. This ensures we're checking each string for potentially sensitive information, and also reduces the scope of each regex operation. Each string will have multiple regex patterns applied to it, to search and replace matches.

The number of regex matches will depend on:

The number of non-empty and non-fully redacted string fields in the HAR file
The number of regex patterns in the list

Other solutions seem to only process the file for a given keyword if that keyword exists in the file. As we're operating on a Struct and not a string, it would already be expensive to determine whether the keyword exists or not, rather than just performing the regular expression processing anyway on the string. As the strings are smaller, it's less expensive per regex process.

Optimisations introduced to avoid performance problems:

Keyword substitution and regex comparisons will only occur for patterns that support keywords (72% reduction in regex comparisons)
Stop processing string if it's fully redacted already from previous pattern search/replace
Compare all keywords in the same regex pattern, instead of a single one per keyword. The result was an 87% reduction in the number of regex processing executions

Edited Feb 29, 2024 by Michael Trainor

Improve detection of sensitive information

Problem

Solution

Regular Expressions

Approach

Merge request reports