Improve dedupe algorithm
Description:
We currently have a basic dedupe algorithm that avoids showing alerts from multiple tools about the same CVE. This was done in a simple way and doesn't cover all possible use cases.
If a project is impacted by multiple occurrences of the same CVEs, the grouping by CVE is not enough as there is no way to ensure a vuln from tool A is matching the same occurrence from tool B without having a solid "cross tool" identification of a vulnerability.
Here is a use case with a maven multi-modules project (but this could also apply to a simple project):
├── pom.xml
├── module_A/
│ ├── dep1
├── moduleB_/
│ ├── dep2
Say dep1 and dep2 are both affected by a SQL injection vuln and this is spotted by both tool1 and tool2. We then have all these issues generated by the tools:
-
-
SQL injectionin dep1, reported by tool1
-
-
-
SQL injectionin dep2, reported by tool1
-
-
-
SQL injectionin dep1, reported by tool2
-
-
-
SQL injectionin dep2, reported by tool2
-
After dedupe we expect to have the following merges:
-
- and 3)
-
- and 4)
But with current implementation we could also have:
-
- and 4)
-
- and 3)
While this is not a big issue today regarding the displayed information, we must fix this before adding more context about each occurrence.
Proposal:
Find a common way to identify a specific occurence of a vulnerability that works for all overlapping tools.
Could be something like a hash of:
- identifier type + identifier value (we mostly have CVE but could be other types like NSP id, or OSBVD id, etc.)
- file path where that vuln comes from (will be dependency file for vuln on 3rd party packages)
- line number within the file (multiple occurrences within the same file is possible)
NOTE: This will require to improve our issue model with the necessary data and structure