Prefer vulnerability correlation over deduplication across report types

Prefer vulnerability correlation over deduplication across report types

Formerly: Improve vulnerability deduplication across report types

Following up on discussion in !225747 (closed), the current deduplication logic doesn't work across report types. This limits our ability to correlate findings from different scanners (e.g., SARIF reports alongside SAST/DAST).

Instead of introducing deduplication across report types we should introduce a canonical report-type agnostic fingerprint. This allows us to correlate similar findings with overly-aggressive deduplication, preserving multiple finding's fidelity while allowing the results to be related and (eventually) co-managed.

Proposal

Add a canonical_fingerprint UUID to vulnerability_occurrences containing (project_id, identifier_fingerprint, location_fingerprint, context_id) (without report_type).

Once potentially-related findings share a canonical fingerprint we can:

  1. Preserve all findings1 from different report types (and possibly scanners)
  2. Introduce a "also detected by" relationship to the vulnerability details by querying vulnerability_occurrences WHERE canonical_fingerprint = $1 AND uuid != $2
  3. Keep codebase policy-aware. A approval policy could check if there is already a dismissed finding with the same canonical_fingerprint
    1. When performing "newly detected" findings check, determine whether any finding with the same canonical_fingerprint already exist with "dismissed" state
  4. Consider cascading dismissals. If a user explicitly opts into "propagate dismissal to related findings," the service has a clean query to find them
  5. (Possible) aggregate deduplication. Group by canonical_fingerprint and show the highest-priority finding as "primary".

If we do this we don't need to modify the existing UUID, it still corresponds to ultimate identity and it won't change any report_type semantics. We also don't need to worry about which scanner is prioritized, although we could still do so if we wanted to prefer native GitLab ones.

Implementation

  1. Add vulnerability_occurrences.canonical_fingerprint and vulnerability_reads.canonical_fingerprint columns
  2. Update ingestion and POROs to compute canonical_fingerprint (Security::Finding and FindingMap)
  3. Expose as new GraphQL canonicalfingerprint field

Additional proposed "enhancements" around correlation policy awareness and correlation UI can probably be deferred until more appropriately specced out.


Original Discussion

The following discussion from !225747 (closed) should be addressed:

  • @minac started a discussion: (+1 comment)

    issue (non-blocking): We should keep in mind that the deduplication logic doesn't work across report types.

  • @theoretick replied

    TBH improving our general deduplication approach is a big item on my TODO list, especially in relation to third party scanners for both ASPM. I don't know how much we can improve it but it's top of mind as we need to decouple our curated handling of analyzer-specific vulns for a more general correlation/grouping direction.


  1. This is under discussion, see #592410 (comment 3158585315) thread

Edited by Lucas Charles