Track Secret Detection findings by filename and value

added 1 deleted label

added Category:Secret Detection devopssecure groupstatic analysis sectionsec typefeature labels

added featureenhancement label

assigned to @connorgilbert

changed the description

Questions to explore
- Do we have the secret value available to use in the fingerprint?
- What is the expected behavior if there are duplicates of the same secret in one file, but on different line numbers?
- What's the experience going to be for someone who has secret detection with the old fingerprinting and they run SD scanning with the new fingerprinting?

Do we have the secret value available to use in the fingerprint?

Yes, it is available in the report generated by the analyzer in the raw_source_code_extract key.

What is the expected behavior if there are duplicates of the same secret in one file, but on different line numbers?

I am not sure it is expected, but I believe the behavior as proposed would only have one finding per file. As proposed, we may not be able to tell that a given file has more than one instance of the finding - there would be no aggregation.

What's the experience going to be for someone who has secret detection with the old fingerprinting and they run SD scanning with the new fingerprinting?

If we did nothing to account for this, the old finding for a given secret would show as no longer present and a new finding for that secret would be created with the new fingerprint. Any triaging of the old finding would not be associated with the new one.

We may be able to adapt the Advanced Vulnerability Tracking that we do in SAST and use it for Secret Detection. If so, we could take the same approach as we do for SAST where we have migrated findings to a new tracking algorithm without losing context. I think this would be the ideal experience.

cc @amarpatel @connorgilbert @sarahwaldner

marked this issue as related to #387583 (closed)

added Category:SAST label

mentioned in issue gitlab-com/Product#13212 (closed)

I'm just thinking out loud here. Since the line number cannot be a unique differentiator here, what if we choose the next closer to it - character offset in the line where the rule pattern matches?

For example:

# File: app/test.rb

1  |# This is a comment blah blah
2  |# example using Gitlab key glpat-123
    ---------------------------^
                               28
...
6 |secret = "glpat-123"
   ----------^
             11
...
19 |def test()
20 |    gitlab_token = "glpat-123"  
    --------------------^
                        21 (spaces included)

Here we have three occurrences having the same secret value(glpat-123) in app/test.rb file:

At line 2 in the comment occurring at 28th char position. Fingerprint -> hash("app/test.rb|gitlab_token|value_hash|28")
At line 6 in the var declaration occurring at 11th position. Fingerprint -> hash("app/test.rb|gitlab_token|value_hash|11")
At line 20 in the var declaration occurring at 21st position. Fingerprint -> hash("app/test.rb|gitlab_token|value_hash|21")

In the above case, the fingerprint remains unique in the following scenarios:

When the new lines are added or removed in the file
When modifications are made in the same line where the leak is detected, after the leak value's position.

And, the fingerprint loses its uniqueness in the following scenarios:

When the file is renamed or moved around
When any modification is made in the same line where the leak is detected, before the leak value's position. Example: Changing the variable name in the var declaration, linting (rare)

The suggested approach is merely an increment to the proposed solution mentioned in the issue and isn't futureproof. It is done purely from a quick-win viewpoint since the engineering effort is minimal and the urgency for a resolution is pretty high for the customers.

As mentioned here, the ideal UX would be a unified view of a single leak value detected across multiple locations instead of each finding for each location, and the ideal solution to this would be changing things from the ground up as suggested here or adopt a technically sophisticated one like this approach.

@connorgilbert @theoretick What are your thoughts? Is it worth pursuing this angle or should we stick with Filename+RuleID+RuleValue approach for now?

@vbhat161 thanks, interesting proposal! Starting at the top

To restate one of the core differentiators for Category:Secret Detection tracking, the findings require manual resolution, so even with the addition of Auto-resolve vulnerabilities when not found in ... (&5708 - closed), tracking issues will continue to introduce a larger group of duplicates.

The suggested approach is merely an increment to the proposed solution mentioned in the issue and isn't futureproof. It is done purely from a quick-win viewpoint since the engineering effort is minimal and the urgency for a resolution is pretty high for the customers.

For that, I'm supportive. Any advanced tracking is better than none .

From @rossfuhrman's earlier comment:

What is the expected behavior if there are duplicates of the same secret in one file, but on different line numbers?

I am not sure it is expected, but I believe the behavior as proposed would only have one finding per file. As proposed, we may not be able to tell that a given file has more than one instance of the finding - there would be no aggregation.

I suppose the main benefit of your proposal is that we would still track two occurrences of the same secret within one file as two separate vulnerabilities. There's benefit there since our schema is currently too limited in not supporting multiple locations, but it also means we are likely to lose tracking more frequently in this scenario:

the fingerprint loses its uniqueness in the following scenarios... When any modification is made in the same line where the leak is detected, before the leak value's position. Example: Changing the variable name in the var declaration, linting (rare)

I might push back on "rare" since it's not only about the name of the assignment but any whitespace changes as well, which would likely be significant, unless we also applied a horizontal compression similar to https://gitlab.com/gitlab-org/security-products/post-analyzers/tracking-calculator/-/merge_requests/53+. Hard to say if the tradeoff is worth it.

tl;dr

Filename+RuleID+RuleValue - pro: less noise in triaging vuln lists, con: deduplicates occurrences within a single file (once you resolve one, it might suddenly jump to a secondary detection)
Filename+RuleID+RuleValue+Offset - pro: tracks multiple occurrences within a single file, con: noisier vuln lists to triage

I lean towards the former due to implementation simplicity and we can always "refine" the algorithm further if it doesn't seem like enough, similar to the SAST tracking work.

Having said that, I think it's ultimately a product question, so I'll defer to @connorgilbert

I might push back on "rare" since it's not only about the name of the assignment but any whitespace changes as well, which would likely be significant

I implied the rarety of horizontal spacing since the Linters are often strict about horizontal spacing by default (demanded by some languages) while not for vertical spacing unless specifically configured. If I understand correctly, https://gitlab.com/gitlab-org/security-products/post-analyzers/tracking-calculator/-/merge_requests/53+ was primarily designed to tackle vertical spacing but not horizontal spacing. Isn't it? Anyway, I do agree that the proposed solution isn't significant enough to parse the source tree.

con: noisier vuln lists to triage

Okay, I thought this was a hard requirement instead i.e., N secret occurrences should create N separate findings even if the same secret is duplicated across multiple locations. If not then Filename+RuleID+RuleValue approach is much simpler

was primarily designed to tackle vertical spacing but not horizontal spacing. Isn't it? Anyway, I do agree that the proposed solution isn't significant enough to parse the source tree.

Yep, it was, but we never considered the other way, primarily because we end up parsing the AST regardless, as you raised.

Okay, I thought this was a hard requirement instead i.e., N secret occurrences should create N separate findings even if the same secret is duplicated across multiple locations. If not then Filename+RuleID+RuleValue approach is much simpler

My interpretation of the following from the issue body was deduping, but perhaps I was mistaken

After changing

A new finding is no longer created if:

A new leak of the same value appears within the same file

My interpretation of the following from the issue body was deduping, but perhaps I was mistaken

You're right. My mistake, I got confused. I might've picked that requirement from somewhere during the exploration phase. So, we create only one finding for a secret value regardless of its multiple occurrences within the same file.

With that approach in mind, I was wondering about the implementation approach. What I roughly have in mind is the following steps:

Introduce a new tracking algorithm type say rule_value in the rails, representing the tracking signature is generated using rule parameters i.e., rule ID and matched value.

From the analyzer side, we introduce a new value to the tracking property in the final Secret Detection Report, similar to how we add scope_offset tracking information for SAST. Similar to the following example :

{
 ...
  "tracking": {
     "type": "source",
     "items": [
       {
         "file": "src/js/main.js",
         "line_start": 16,
         "line_end": 16,
         "signatures": [
           {
             "algorithm": "rule_value",
             "value": "src/js/main.js|gitlab_token|<sha256(secret)>"
           }
         ]
        }
      ]
   }
}

If I understand it right, with these two steps we should be able to create findings for the new detections whereas the existing findings remain intact until their identified secret moves within the file (due to UUID override logic). Please correct me if I'm wrong here.

Am I thinking in the right direction? Do you have any feedback?

/cc @rossfuhrman tagging you for feedback too in case you've worked around these concepts before

Am I thinking in the right direction? Do you have any feedback?

@vbhat161 yes, those steps look correct to me!

As a sidenote, one limitation of ALGORITHM_TYPES is that the collection is an enumeration of prioritized signatures. Since it's one-dimensional this could be problematic later if we need to balance scope_offset_compressed with rule_value, but for now I think it makes sense to make rule_value the highest priority (last entry in collection).

So, we create only one finding for a secret value regardless of its multiple occurrences within the same file.

Just thinking out loud: Another reason this approach makes more sense is primarily the remediation flow. A SAST remediation involves fixing the vulnerable line of code, a secret remediation involves the external process of revocation. As such, it doesn't really matter that there are N occurrences, revoking the remediation is the same.

mentioned in issue #450732 (closed)

assigned to @vbhat161

added workflowplanning breakdown label

changed milestone to %16.11

changed health status to on track

@connorgilbert Do we plan to support this feature for all the tiers or only for GitLab Ultimate like AVT?

@vbhat161 sorry for the delay, I hope the specific question was answered along the way to the later progress. But yes tracking only matters in GitLab Ultimate because Vuln Management is limited to Ultimate. cc @smeadzinger

removed Category:SAST label

added Category:SAST label

added groupsecret detection label and removed Category:SAST groupstatic analysis labels

mentioned in merge request gitlab-org/security-products/analyzers/secrets!289 (closed)

mentioned in commit gitlab-org/security-products/analyzers/secrets@8eaec389

mentioned in merge request !147549 (merged)

added workflowin dev label and removed workflowplanning breakdown label

Update

Technical Risk Evaluation: Introducing tracking-based fingerprinting for SD primarily impacts the de-duplication of findings logic, which happens in two places - when storing the findings in the DB (Vulnerability Report), and when comparing the pipeline report findings between the base and latest pipelines (MR widget).

Implementation Approach: The solution is a two-part implementation:

Secret Analyzer: Embed tracking information using the rule_value algorithm for each vulnerability in the Secret Detection Report.
Rails monolith: Utilize the tracking information from the SD report to ensure new findings are not created if there are findings for the same secret present in the system.

I have created draft MRs for both Secret Analyzer and Rails implementations and assign @rossfuhrman/@theoretick for initial round of reviews. I'll update this issue with the expected outcome of the MRs soon.

/cc @amarpatel

Update: 3 Apr 24

There has been a slight modification in the implementation approach outlined above. As described here, the idea of embedding tracking information within the SD analyzer seemed inconsistent when compared with the way we did for SAST analyzers. To keep it consistent, we'll move the logic of "Embed tracking information using the rule_value algorithm for each vulnerability in the Secret Detection Report." to the Tracking Calculator project and raise an MR from there. This will add a couple of extra steps in the implementation steps as outlined below:

Tracking Calculator: Embed tracking information using the rule_value algorithm for each vulnerability in the Secret Detection Report.
SD Analyzer: Add Tracking Calculator dependency in the Dockerfile[-fips]
Post Analyzer Scripts: Add support for processing Secret Detection report
Rails monolith: Utilize the tracking information from the SD report to ensure new findings are not created if there are findings for the same secret present in the system.

Added Limitations section in the issue description to keep the implementation's drawbacks transparent

Update: 5 Apr 2024

MR changes for Tracking Calculator, Post Analyzer script and Secret Detection Analyzer have been created. The first two MRs are currently being reviewed, and once they're merged, the third modification will be open for review.
RSpec tests have not yet been added for the implementation in the Rails MR. This is a pending dev task ATM.
MRs for Documentation-related changes are yet to be created.

/cc @amarpatel