Track Secret Detection findings by filename and value
Background
See #387583 (closed).
This issue is to discuss a particular potential solution to the problem.
Motivation
Today, SD findings are fingerprinted within the project by filename and line number. This causes significant numbers of new findings to be created due to non-semantically-meaningfully changes.
The current vulnerability data model makes it difficult for us to immediately adopt a fingerprinting definition that covers vulns that occur in more than one location. For this reason, we can't immediately jump to, for example, fingerprinting by rule ID and value, which would allow us to track a single leaked credential as the top-level object with multiple leak locations. But, we shouldn't let the perfect be the enemy of the good, so this proposal introduces an incremental step forward that should solve the most common and high-volume problems.
Proposal
Fingerprint findings by filename, rule ID, and secret value.
Intended semantics
The overall goal is to reduce user impression of error (or, put another way, to minimize user surprise). This means that the system should create a new finding when a reasonable observer would judge the new finding to be a new leak.
Today (before changing anything)
A new finding is created, and the old one is "orphaned", when:
- A leak moves within a file
- A new leak of the same value appears within the same file
- A leak moves across files
And, of course, a new finding is created when a brand-new leak occurs.
After changing
A new finding is no longer created if:
- A leak moves within a file
- A new leak of the same value appears within the same file
Otherwise, the existing workflow (MR widget, pipeline report, vuln report) should treat the findings the same as before.
Limitations
Regex patterns for several cryptographic keys in the ruleset match only the BEGIN ...
keywords but not the whole value. This leads to false negatives for the scenarios when there are multiple occurrences with different values for the same rule type within the same file. In this case, we would end up considering only one finding matching the rule type instead of multiple findings.
Questions to answer
- Are there side effects to anticipate if we change the fingerprinting logic?
- Do we have an upgrade path if we change the logic for secrets? (Compare to SAST, where we can have multiple fingerprinting algorithms and the comparison logic picks the best one, so it’s ~easy to release updated logic)
Next steps
- Confirm that intended semantics are met by this proposal.
- Identify any technical risks related to how fingerprints are used, for example in the MR widget, pipeline report, vuln report, or security policies.