Prefer vulnerability correlation over deduplication across report types
Prefer vulnerability correlation over deduplication across report types
Formerly: Improve vulnerability deduplication across report types
Following up on discussion in !225747 (closed), the current deduplication logic doesn't work across report types. This limits our ability to correlate findings from different scanners (e.g., SARIF reports alongside SAST/DAST).
Instead of introducing deduplication across report types we should introduce a canonical report-type agnostic fingerprint. This allows us to correlate similar findings with overly-aggressive deduplication, preserving multiple finding's fidelity while allowing the results to be related and (eventually) co-managed.
Proposal
Add a canonical_fingerprint UUID to vulnerability_occurrences containing (project_id, identifier_fingerprint, location_fingerprint, context_id) (without report_type).
Once potentially-related findings share a canonical fingerprint we can:
- Preserve all findings1 from different report types (and possibly scanners)
- Introduce a "also detected by" relationship to the vulnerability details by querying
vulnerability_occurrences WHERE canonical_fingerprint = $1 AND uuid != $2 - Keep codebase policy-aware. A approval policy could check if there is already a dismissed finding with the same
canonical_fingerprint- When performing "newly detected" findings check, determine whether any finding with the same canonical_fingerprint already exist with "dismissed" state
- Consider cascading dismissals. If a user explicitly opts into "propagate dismissal to related findings," the service has a clean query to find them
- (Possible) aggregate deduplication. Group by
canonical_fingerprintand show the highest-priority finding as "primary".
If we do this we don't need to modify the existing UUID, it still corresponds to ultimate identity and it won't change any report_type semantics. We also don't need to worry about which scanner is prioritized, although we could still do so if we wanted to prefer native GitLab ones.
Implementation
- Add
vulnerability_occurrences.canonical_fingerprintandvulnerability_reads.canonical_fingerprintcolumns - Update ingestion and POROs to compute
canonical_fingerprint(Security::FindingandFindingMap) - Expose as new GraphQL
canonicalfingerprintfield
Additional proposed "enhancements" around correlation policy awareness and correlation UI can probably be deferred until more appropriately specced out.
Original Discussion
The following discussion from !225747 (closed) should be addressed:
-
@minac started a discussion: (+1 comment)
issue (non-blocking): We should keep in mind that the deduplication logic doesn't work across report types.
-
@theoretick replied
TBH improving our general deduplication approach is a big item on my TODO list, especially in relation to third party scanners for both ASPM. I don't know how much we can improve it but it's top of mind as we need to decouple our curated handling of analyzer-specific vulns for a more general correlation/grouping direction.
-
This is under discussion, see #592410 (comment 3158585315) thread ↩