This issue is to discuss a particular potential solution to the problem.
Motivation
Today, SD findings are fingerprinted within the project by filename and line number. This causes significant numbers of new findings to be created due to non-semantically-meaningfully changes.
The current vulnerability data model makes it difficult for us to immediately adopt a fingerprinting definition that covers vulns that occur in more than one location. For this reason, we can't immediately jump to, for example, fingerprinting by rule ID and value, which would allow us to track a single leaked credential as the top-level object with multiple leak locations. But, we shouldn't let the perfect be the enemy of the good, so this proposal introduces an incremental step forward that should solve the most common and high-volume problems.
Proposal
Fingerprint findings by filename, rule ID, and secret value.
Intended semantics
The overall goal is to reduce user impression of error (or, put another way, to minimize user surprise). This means that the system should create a new finding when a reasonable observer would judge the new finding to be a new leak.
Today (before changing anything)
A new finding is created, and the old one is "orphaned", when:
A leak moves within a file
A new leak of the same value appears within the same file
A leak moves across files
And, of course, a new finding is created when a brand-new leak occurs.
After changing
A new finding is no longer created if:
A leak moves within a file
A new leak of the same value appears within the same file
Otherwise, the existing workflow (MR widget, pipeline report, vuln report) should treat the findings the same as before.
Limitations
Regex patterns for several cryptographic keys in the ruleset match only the BEGIN ... keywords but not the whole value. This leads to false negatives for the scenarios when there are multiple occurrences with different values for the same rule type within the same file. In this case, we would end up considering only one finding matching the rule type instead of multiple findings.
Questions to answer
Are there side effects to anticipate if we change the fingerprinting logic?
Do we have an upgrade path if we change the logic for secrets? (Compare to SAST, where we can have multiple fingerprinting algorithms and the comparison logic picks the best one, so it’s ~easy to release updated logic)
Next steps
Confirm that intended semantics are met by this proposal.
Identify any technical risks related to how fingerprints are used, for example in the MR widget, pipeline report, vuln report, or security policies.
Do we have the secret value available to use in the fingerprint?
What is the expected behavior if there are duplicates of the same secret in one file, but on different line numbers?
What's the experience going to be for someone who has secret detection with the old fingerprinting and they run SD scanning with the new fingerprinting?
Do we have the secret value available to use in the fingerprint?
Yes, it is available in the report generated by the analyzer in the raw_source_code_extract key.
What is the expected behavior if there are duplicates of the same secret in one file, but on different line numbers?
I am not sure it is expected, but I believe the behavior as proposed would only have one finding per file. As proposed, we may not be able to tell that a given file has more than one instance of the finding - there would be no aggregation.
What's the experience going to be for someone who has secret detection with the old fingerprinting and they run SD scanning with the new fingerprinting?
If we did nothing to account for this, the old finding for a given secret would show as no longer present and a new finding for that secret would be created with the new fingerprint. Any triaging of the old finding would not be associated with the new one.
We may be able to adapt the Advanced Vulnerability Tracking that we do in SAST and use it for Secret Detection. If so, we could take the same approach as we do for SAST where we have migrated findings to a new tracking algorithm without losing context. I think this would be the ideal experience.
I'm just thinking out loud here. Since the line number cannot be a unique differentiator here, what if we choose the next closer to it - character offset in the line where the rule pattern matches?
For example:
# File: app/test.rb1|# This is a comment blah blah2|# example using Gitlab key glpat-123---------------------------^28...6|secret="glpat-123"----------^11...19|deftest()20|gitlab_token="glpat-123"--------------------^21(spacesincluded)
Here we have three occurrences having the same secret value(glpat-123) in app/test.rb file:
At line 2 in the comment occurring at 28th char position. Fingerprint -> hash("app/test.rb|gitlab_token|value_hash|28")
At line 6 in the var declaration occurring at 11th position. Fingerprint -> hash("app/test.rb|gitlab_token|value_hash|11")
At line 20 in the var declaration occurring at 21st position. Fingerprint -> hash("app/test.rb|gitlab_token|value_hash|21")
In the above case, the fingerprint remains unique in the following scenarios:
When the new lines are added or removed in the file
When modifications are made in the same line where the leak is detected, after the leak value's position.
And, the fingerprint loses its uniqueness in the following scenarios:
When the file is renamed or moved around
When any modification is made in the same line where the leak is detected, before the leak value's position. Example: Changing the variable name in the var declaration, linting (rare)
The suggested approach is merely an increment to the proposed solution mentioned in the issue and isn't futureproof. It is done purely from a quick-win viewpoint since the engineering effort is minimal and the urgency for a resolution is pretty high for the customers.
As mentioned here, the ideal UX would be a unified view of a single leak value detected across multiple locations instead of each finding for each location, and the ideal solution to this would be changing things from the ground up as suggested here or adopt a technically sophisticated one like this approach.
@connorgilbert@theoretick What are your thoughts? Is it worth pursuing this angle or should we stick with Filename+RuleID+RuleValue approach for now?
The suggested approach is merely an increment to the proposed solution mentioned in the issue and isn't futureproof. It is done purely from a quick-win viewpoint since the engineering effort is minimal and the urgency for a resolution is pretty high for the customers.
For that, I'm supportive. Any advanced tracking is better than none .
What is the expected behavior if there are duplicates of the same secret in one file, but on different line numbers?
I am not sure it is expected, but I believe the behavior as proposed would only have one finding per file. As proposed, we may not be able to tell that a given file has more than one instance of the finding - there would be no aggregation.
I suppose the main benefit of your proposal is that we would still track two occurrences of the same secret within one file as two separate vulnerabilities. There's benefit there since our schema is currently too limited in not supporting multiple locations, but it also means we are likely to lose tracking more frequently in this scenario:
the fingerprint loses its uniqueness in the following scenarios... When any modification is made in the same line where the leak is detected, before the leak value's position. Example: Changing the variable name in the var declaration, linting (rare)
Filename+RuleID+RuleValue - pro: less noise in triaging vuln lists, con: deduplicates occurrences within a single file (once you resolve one, it might suddenly jump to a secondary detection)
Filename+RuleID+RuleValue+Offset - pro: tracks multiple occurrences within a single file, con: noisier vuln lists to triage
I lean towards the former due to implementation simplicity and we can always "refine" the algorithm further if it doesn't seem like enough, similar to the SAST tracking work.
Having said that, I think it's ultimately a product question, so I'll defer to @connorgilbert
I might push back on "rare" since it's not only about the name of the assignment but any whitespace changes as well, which would likely be significant
I implied the rarety of horizontal spacing since the Linters are often strict about horizontal spacing by default (demanded by some languages) while not for vertical spacing unless specifically configured. If I understand correctly, https://gitlab.com/gitlab-org/security-products/post-analyzers/tracking-calculator/-/merge_requests/53+ was primarily designed to tackle vertical spacing but not horizontal spacing. Isn't it? Anyway, I do agree that the proposed solution isn't significant enough to parse the source tree.
con: noisier vuln lists to triage
Okay, I thought this was a hard requirement instead i.e., N secret occurrences should create N separate findings even if the same secret is duplicated across multiple locations. If not then Filename+RuleID+RuleValue approach is much simpler
was primarily designed to tackle vertical spacing but not horizontal spacing. Isn't it? Anyway, I do agree that the proposed solution isn't significant enough to parse the source tree.
Yep, it was, but we never considered the other way, primarily because we end up parsing the AST regardless, as you raised.
Okay, I thought this was a hard requirement instead i.e., N secret occurrences should create N separate findings even if the same secret is duplicated across multiple locations. If not then Filename+RuleID+RuleValue approach is much simpler
My interpretation of the following from the issue body was deduping, but perhaps I was mistaken
After changing
A new finding is no longer created if:
A new leak of the same value appears within the same file
My interpretation of the following from the issue body was deduping, but perhaps I was mistaken
You're right. My mistake, I got confused. I might've picked that requirement from somewhere during the exploration phase. So, we create only one finding for a secret value regardless of its multiple occurrences within the same file.
With that approach in mind, I was wondering about the implementation approach. What I roughly have in mind is the following steps:
Introduce a new tracking algorithm type say rule_value in the rails, representing the tracking signature is generated using rule parameters i.e., rule ID and matched value.
From the analyzer side, we introduce a new value to the tracking property in the final Secret Detection Report, similar to how we add scope_offset tracking information for SAST. Similar to the following example :
If I understand it right, with these two steps we should be able to create findings for the new detections whereas the existing findings remain intact until their identified secret moves within the file (due to UUID override logic). Please correct me if I'm wrong here.
Am I thinking in the right direction? Do you have any feedback?
/cc @rossfuhrman tagging you for feedback too in case you've worked around these concepts before
As a sidenote, one limitation of ALGORITHM_TYPES is that the collection is an enumeration of prioritized signatures. Since it's one-dimensional this could be problematic later if we need to balance scope_offset_compressed with rule_value, but for now I think it makes sense to make rule_value the highest priority (last entry in collection).
So, we create only one finding for a secret value regardless of its multiple occurrences within the same file.
Just thinking out loud: Another reason this approach makes more sense is primarily the remediation flow. A SAST remediation involves fixing the vulnerable line of code, a secret remediation involves the external process of revocation. As such, it doesn't really matter that there are N occurrences, revoking the remediation is the same.
@vbhat161 sorry for the delay, I hope the specific question was answered along the way to the later progress. But yes tracking only matters in GitLab Ultimate because Vuln Management is limited to Ultimate. cc @smeadzinger
Technical Risk Evaluation: Introducing tracking-based fingerprinting for SD primarily impacts the de-duplication of findings logic, which happens in two places - when storing the findings in the DB (Vulnerability Report), and when comparing the pipeline report findings between the base and latest pipelines (MR widget).
Implementation Approach:
The solution is a two-part implementation:
Secret Analyzer: Embed tracking information using the rule_value algorithm for each vulnerability in the Secret Detection Report.
Rails monolith: Utilize the tracking information from the SD report to ensure new findings are not created if there are findings for the same secret present in the system.
I have created draft MRs for both Secret Analyzer and Rails implementations and assign @rossfuhrman/@theoretick for initial round of reviews. I'll update this issue with the expected outcome of the MRs soon.
There has been a slight modification in the implementation approach outlined above. As described here, the idea of embedding tracking information within the SD analyzer seemed inconsistent when compared with the way we did for SAST analyzers. To keep it consistent, we'll move the logic of "Embed tracking information using the rule_value algorithm for each vulnerability in the Secret Detection Report." to the Tracking Calculator project and raise an MR from there. This will add a couple of extra steps in the implementation steps as outlined below:
Tracking Calculator: Embed tracking information using the rule_value algorithm for each vulnerability in the Secret Detection Report.
SD Analyzer: Add Tracking Calculator dependency in the Dockerfile[-fips]
Rails monolith: Utilize the tracking information from the SD report to ensure new findings are not created if there are findings for the same secret present in the system.
Added Limitations section in the issue description to keep the implementation's drawbacks transparent