Skip to content

Improve dedupe algorithm

Description:

We currently have a basic dedupe algorithm that avoids showing alerts from multiple tools about the same CVE. This was done in a simple way and doesn't cover all possible use cases.

If a project is impacted by multiple occurrences of the same CVEs, the grouping by CVE is not enough as there is no way to ensure a vuln from tool A is matching the same occurrence from tool B without having a solid "cross tool" identification of a vulnerability.

Here is a use case with a maven multi-modules project (but this could also apply to a simple project):

├── pom.xml
├── module_A/
│   ├── dep1
├── moduleB_/
│   ├── dep2

Say dep1 and dep2 are both affected by a SQL injection vuln and this is spotted by both tool1 and tool2. We then have all these issues generated by the tools:

    1. SQL injection in dep1, reported by tool1
    1. SQL injection in dep2, reported by tool1
    1. SQL injection in dep1, reported by tool2
    1. SQL injection in dep2, reported by tool2

After dedupe we expect to have the following merges:

    1. and 3)
    1. and 4)

But with current implementation we could also have:

    1. and 4)
    1. and 3)

While this is not a big issue today regarding the displayed information, we must fix this before adding more context about each occurrence.

Proposal:

Find a common way to identify a specific occurence of a vulnerability that works for all overlapping tools.

Could be something like a hash of:

  • identifier type + identifier value (we mostly have CVE but could be other types like NSP id, or OSBVD id, etc.)
  • file path where that vuln comes from (will be dependency file for vuln on 3rd party packages)
  • line number within the file (multiple occurrences within the same file is possible)

NOTE: This will require to improve our issue model with the necessary data and structure