Prefer vulnerability correlation over deduplication across report types
# Prefer vulnerability correlation over deduplication across report types
Formerly: _Improve vulnerability deduplication across report types_
Following up on discussion in [!225747](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/225747), the current deduplication logic doesn't work across report types. This limits our ability to correlate findings from different scanners (e.g., SARIF reports alongside SAST/DAST).
Instead of introducing deduplication across report types we should introduce a **canonical report-type agnostic fingerprint**. This allows us to correlate similar findings with overly-aggressive deduplication, preserving multiple finding's fidelity while allowing the results to be related and (eventually) co-managed.
## Proposal
Add a `canonical_fingerprint UUID` to `vulnerability_occurrences` containing `(project_id, identifier_fingerprint, location_fingerprint, context_id)` (without `report_type`).
Once potentially-related findings share a canonical fingerprint we can:
1. Preserve all findings[^1] from different report types (and possibly scanners)
1. Introduce a "also detected by" relationship to the vulnerability details by querying `vulnerability_occurrences WHERE canonical_fingerprint = $1 AND uuid != $2`
1. Keep codebase [policy](https://docs.gitlab.com/user/application_security/policies/merge_request_approval_policies/)-aware. A approval policy could check if there is already a dismissed finding with the same `canonical_fingerprint`
1. When performing "newly detected" findings check, determine whether any finding with the same canonical_fingerprint already exist with "dismissed" state
3. Consider cascading dismissals. If a user explicitly opts into "propagate dismissal to related findings," the service has a clean query to find them
4. (Possible) aggregate deduplication. Group by `canonical_fingerprint` and show the highest-priority finding as "primary".
If we do this we don't need to modify the existing UUID, it still corresponds to ultimate identity and it won't change any report_type semantics. We also don't need to worry about which scanner is prioritized, although we could still do so if we wanted to prefer native GitLab ones.
## Implementation
1. Add `vulnerability_occurrences.canonical_fingerprint` and `vulnerability_reads.canonical_fingerprint` columns
1. Update ingestion and POROs to compute `canonical_fingerprint` (`Security::Finding` and `FindingMap`)
2. Expose as new GraphQL `canonicalfingerprint` field
Additional proposed "enhancements" around correlation policy awareness and correlation UI can probably be deferred until more appropriately specced out.
---
## Original Discussion
The following discussion from !225747 should be addressed:
- [ ] @minac started a [discussion](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/225747#note_3130999207): (+1 comment)
> **issue (non-blocking):** We should keep in mind that the deduplication logic doesn't work across report types.
- [ ] @theoretick replied
> TBH improving our general deduplication approach is a big item on my TODO list, especially [in relation to third party scanners for both ASPM](https://gitlab.com/groups/gitlab-org/-/work_items/20900#note_3120636957). I don't know how much we can improve it but it's top of mind as we need to decouple our curated handling of analyzer-specific vulns for a more general correlation/grouping direction.
---
[^1]: This is under discussion, see https://gitlab.com/gitlab-org/gitlab/-/work_items/592410#note_3158585315 thread
issue