Prefer vulnerability correlation over deduplication across report types (#592410) · Issues · GitLab.org / GitLab

Prefer vulnerability correlation over deduplication across report types

# Prefer vulnerability correlation over deduplication across report types Formerly: _Improve vulnerability deduplication across report types_ Following up on discussion in [!225747](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/225747), the current deduplication logic doesn't work across report types. This limits our ability to correlate findings from different scanners (e.g., SARIF reports alongside SAST/DAST). Instead of introducing deduplication across report types we should introduce a **canonical report-type agnostic fingerprint**. This allows us to correlate similar findings with overly-aggressive deduplication, preserving multiple finding's fidelity while allowing the results to be related and (eventually) co-managed. ## Proposal Add a `canonical_fingerprint UUID` to `vulnerability_occurrences` containing `(project_id, identifier_fingerprint, location_fingerprint, context_id)` (without `report_type`). Once potentially-related findings share a canonical fingerprint we can: 1. Preserve all findings[^1] from different report types (and possibly scanners) 1. Introduce a "also detected by" relationship to the vulnerability details by querying `vulnerability_occurrences WHERE canonical_fingerprint = $1 AND uuid != $2` 1. Keep codebase [policy](https://docs.gitlab.com/user/application_security/policies/merge_request_approval_policies/)-aware. A approval policy could check if there is already a dismissed finding with the same `canonical_fingerprint` 1. When performing "newly detected" findings check, determine whether any finding with the same canonical_fingerprint already exist with "dismissed" state 3. Consider cascading dismissals. If a user explicitly opts into "propagate dismissal to related findings," the service has a clean query to find them 4. (Possible) aggregate deduplication. Group by `canonical_fingerprint` and show the highest-priority finding as "primary". If we do this we don't need to modify the existing UUID, it still corresponds to ultimate identity and it won't change any report_type semantics. We also don't need to worry about which scanner is prioritized, although we could still do so if we wanted to prefer native GitLab ones. ## Implementation 1. Add `vulnerability_occurrences.canonical_fingerprint` and `vulnerability_reads.canonical_fingerprint` columns 1. Update ingestion and POROs to compute `canonical_fingerprint` (`Security::Finding` and `FindingMap`) 2. Expose as new GraphQL `canonicalfingerprint` field Additional proposed "enhancements" around correlation policy awareness and correlation UI can probably be deferred until more appropriately specced out. --- ## Original Discussion The following discussion from !225747 should be addressed: - [ ] @minac started a [discussion](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/225747#note_3130999207): (+1 comment) > **issue (non-blocking):** We should keep in mind that the deduplication logic doesn't work across report types. - [ ] @theoretick replied > TBH improving our general deduplication approach is a big item on my TODO list, especially [in relation to third party scanners for both ASPM](https://gitlab.com/groups/gitlab-org/-/work_items/20900#note_3120636957). I don't know how much we can improve it but it's top of mind as we need to decouple our curated handling of analyzer-specific vulns for a more general correlation/grouping direction. --- [^1]: This is under discussion, see https://gitlab.com/gitlab-org/gitlab/-/work_items/592410#note_3158585315 thread

issue