Improve detection of sensitive information
Problem
Previously, harcleaner would only process specific known elements of the JSON file. We needed to be able to redact sensitive information based on pattern matching and keywords.
Solution
Influenced by the following projects:
The solution makes heavy use of Golang's reflect
package. I don't pretend
to understand Golang reflection
sufficiently, so this implementation might be very clumsy.
The HAR's entries are now walked to ensure that each underlying struct or string is processed. This includes checking for the existence of keywords, and using regular expressions to search and replace known patterns such as tokens or credentials.
Specific elements of the HAR's entries are still handled similarly to how they were before, but these checks occur during the walking of the HAR entry's structs.
I removed the specific redaction of the QueryString and and Headers, as these are now handled by keyword match and redaction. These elements were fully redacted before out of caution, but now will be partially redacted, as some of the information in these fields will be useful.
Cookies are still always redacted, as we don't think they will be useful for our analysis.
Regular Expressions
This introduces the following regular expressions to search/replace:
-
key=value
where thekey
matches a list of keywords - JWT tokens
- GitLab tokens
-
user:password
in URLs
We can easily add more regular expressions as we learn more about what to redact.
When multiple regular expressions are applied to a string, the result of each pattern match/replace is the input string to the next regular expression. This ensures that a string that matches multiple patterns will be redacted.
Approach
I decided to perform the regular expression search/replace operations on each individual struct's string field. This ensures we're checking each string for potentially sensitive information, and also reduces the scope of each regex operation. Each string will have multiple regex patterns applied to it, to search and replace matches.
The number of regex matches will depend on:
- The number of non-empty and non-fully redacted string fields in the HAR file
- The number of regex patterns in the list
Other solutions seem to only process the file for a given keyword if that keyword exists in the file. As we're operating on a Struct and not a string, it would already be expensive to determine whether the keyword exists or not, rather than just performing the regular expression processing anyway on the string. As the strings are smaller, it's less expensive per regex process.
Optimisations introduced to avoid performance problems:
- Keyword substitution and regex comparisons will only occur for patterns that support keywords (72% reduction in regex comparisons)
- Stop processing string if it's fully redacted already from previous pattern search/replace
- Compare all keywords in the same regex pattern, instead of a single one per keyword. The result was an 87% reduction in the number of regex processing executions