Improve Vulnerability Occurrence Uniqueness/Tracking
## Summary
Vulnerability uniqueness and tracking is a hard topic. This issue is a proposal to change
the way we determine and track distinct vulnerabilities so that it is maintainable,
extendable, and flexible.
Top-level points:
1. The `location` field is a misnomer
2. The current uniqueness field, `location_fingerprint`, composed of data from the `location` field and
[implemented](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/parsers/security/common.rb#L63) ->
[like](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/parsers/security/sast.rb#L14-21) ->
[this](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/reports/security/locations/sast.rb#L25-27),
should not be analyzer-type specific
3. We assume the intention of the analyzer by only hashing the resulting `location_fingerprint` value
4. Analyzers need a way to declare their intention in how distinct vulnerabilities are tracked or determined to be unique between different versions of code
5. Specific identifiers (CWE, CVE, etc.) paired with the vulnerability location
*should* be used to identify distinct vulnerabilities.
## Background
### Questions / Problems
Below are a few hard questions or problems related to dealing with vulnerability uniqueness:
* How do we track a distinct vulnerability throughout the life of a project, especially considering refactoring, analyzer improvements, and code changes?
* Which attributes of a vulnerability do we key off of?
* Can findings from multiple analyzers automatically and reliably be reduced to a single vulnerability occurrence?
* At which point has the code changed enough that it becomes a new vulnerability (similar to the [Ship of Theseus thought experiment](https://en.wikipedia.org/wiki/Ship_of_Theseus))?
* How do we know the intention of an analyzer in identifying or comparing the uniqueness of two vulnerabilities?
In addition to the problems above, we also face the following problems:
* Analyzers focus on *identifying* vulnerabilities, not robustly tracking them throughout the life of an evolving project
### Current Implementation
Currently, vulnerability uniqueness is determined by using a hash
of data derived from the `location` field defined in a vulnerability
occurrence.
1. Analyzers place the information they deem critical to identifying
uniqueness into the `location` field - [sast schema](https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/fde94b0c22c107911d9f46ff0f6c06e3464c621c/src/sast-report-format.json).
2. Analogous rails code then extracts specific fields from the location to construct the `location_fingerprint` value -
[common.rb](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/parsers/security/common.rb#L63) ->
[parsers/../sast.rb](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/parsers/security/sast.rb#L14-21) ->
[locations/sast.rb](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/reports/security/locations/sast.rb#L25-27).
3. A hash of the `location_fingerprint` value is then used to determine vulnerability uniqueness -
[models/vulnerabilities/finding.rb](https://gitlab.com/gitlab-org/gitlab/-/blob/576fea813d60c00ce483738e18cb417a4e0c512e/ee/app/models/vulnerabilities/finding.rb#L294-296).
Below are two examples of the different
ways analyzers report the vulnerability location (aka uniqueness):
#### SAST
[security-report-schemas source](https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/fde94b0c22c107911d9f46ff0f6c06e3464c621c/src/sast-report-format.json)
```json
{
"file": "path/to/file",
"start_line": 100,
"end_line": 200,
"class": "a_class",
"method": "a_method"
}
```
#### Dependency Scanning
[security-report-schemas source](https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/fde94b0c22c107911d9f46ff0f6c06e3464c621c/src/dependency-scanning-report-format.json)
```json
{
"file": "path/to/file",
"dependency": {
"package": {
"name": "name_of_package"
},
"version": "1.2.3"
}
}
```
#### Coverage-guided Fuzzing
[security-report-schemas source](https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/723e4cf4b35f6030ff97d09ff4e7df882e2310cd/src/fuzz-coverage-report-format.json)
```json
{
"crash_address": "module+offset",
"crash_type": "USE_AFTER_FREE",
"minimized_stack_trace": "..."
}
```
## 1. The `location` Field is a Misnomer
Due to the `location` field also being used to determine vulnerability uniqueness, two things occur:
* the `location` field becomes a misnomer
* analyzers that determine vulnerability uniqueness from more than the
error location are forced to add non-location-specific information into the
`location` field
The coverage-guided fuzzing location object is an example of this - the
stack trace and crash type have nothing to do with where the vulnerability
occurred and everything to do with determining uniqueness.
A uniqueness- or tracking-specific field (`uniqueness` or `tracking`?) should be added to the required
fields of a vulnerability occurrence. This would be explicitly used to allow
analyzers to declare how a vulnerability can be determined to be distinct from
other vulnerabilities, including from previous scans on other versions of code.
## 2. The Uniqueness Value Should Not be Analyzer-Type Specific
The `location` field (aka uniqueness field) is strictly typed in the
security-report-schemas for each analyzer type. All SAST-type analyzers are
required to structure their location field the same way, as are all DAST-type
analyzers, etc.
This has the effect of restricting future SAST / DAST/ * analyzers that may
wish to determine vulnerability uniqueness differently than how vulnerability
locations are declared.
For example, most SAST tools identify vulnerability uniqueness based on the
position in the code. What if a graph-based SAST tool used control-flow
paths from `source -> sink` to determine vulnerability uniqueness? It is
still a SAST tool, but the uniqueness (`location`) field would not support
the analyzer's intention of using `source -> sink` paths as a uniqueness
metric.
## 3. We Assume the Intention of the Analyzer
By only hashing a value composed of data from the `location` field and
[using that as a uniqueness metric](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/app/models/vulnerabilities/finding.rb#L287-296),
we are assuming the intention of the analyzer.
A few examples make this easier to talk about:
**SAST**
Vulnerability occurrences in a SAST report are often identified as unique
based on their position in the source code. The intention of specifying a
line number/offset/column in a file is to specify a position or statement
in the code, not the actual file+line number value. Reformatting the code
or inserting lines before the line containing the vulnerability should have
no effect on identifying the vulnerability.
To put this another way, the literal `file+line` values are not the intended
uniqueness metric - the data pointed to by the `file+line` values is the
intended uniqueness value.
All analyzers (not just SAST) that have the intention of identifying vulnerability
uniqueness based on their position in the source code are having their intent
misassumed when their location value is hashed.
**Coverage-Guided Fuzzing**
All of the coverage-guided fuzzing features haven't landed yet, but it's
worth talking about.
Coverage-guided [fuzzing determines uniqueness](https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/merge_requests/42#note_383256594)
based on the:
* type of crash
* a truncated stack trace
* the error address
The uniqueness intent of the coverage-guided fuzzing analyzer *is* to operate
on distinct sets of these values. Aside from the `location` field being
a misnomer, the current implementation of comparing uniqueness *does* reflect
the intent of coverage-guided fuzzing.
## 4. Analyzers Need a Way to Declare Uniqueness or Tracking Intent
Instead of using the `location` field to determine the vulnerability's
uniqueness, security reports should have a field specific to declaring the
uniqueness intent of the analyzer.
I am proposing that we define a series of typed JSON fields from which analyzers
can choose. Perhaps something like this:
**Position-based Uniqueness**
Analyzers that depend on the position of data within project files to determine
uniqueness could use this uniqueness type:
```json
"tracking": {
"type": "position",
"file": "path/to/file",
"line": 10,
"column": 5,
}
```
**Match-based Uniqueness - An Example of What We Don't Want**
*NO* - I originally had this as a uniqueness type, but have realized that this is a scanning/detection rule, not a uniqueness type. Instead of removing this example, I think it is useful to leave it in as an example of what we should not add as uniqueness type
My example for this was dependency scanning - dependency scanning should [continue using hash-based uniqueness based on the file path and package name](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/reports/security/locations/dependency_scanning.rb#L21-23).
> Analyzers that depend on the existence of a string within project files to
> determine uniqueness could use these types.
>
>
> "tracking": {
> "type": "match"
> "regex": "^\s*[^#]package_name==1.2.3$",
> "file": "path/to/file",
> }
**Hash-based Uniqueness**
Analyzers that wish uniqueness to be determined from a direct hash of
an analyzer-provided value could use a `hash` uniqueness type.
```js
"tracking": {
"type": "hash",
"data": "DATA TO HASH", // maybe Object()s too?
}
```
## 5. Use Specific Identifiers Paired with Uniqueness Data to Identify Distinct Vulnerabilities
Having a uniqueness value declared by the analyzer helps us identify distinct
vulnerabilities. However, it is not the complete story. If we want to be able
to deduplicate vulnerabilities from separate analyzers, we will also need to
include vulnerability identifiers.
This is roughly [how the code currently works](https://gitlab.com/gitlab-org/gitlab/-/blob/576fea813d60c00ce483738e18cb417a4e0c512e/ee/app/services/security/merge_reports_service.rb#L85-108),
except that CWE and WASC identifiers are ignored. I do not believe this should
be the case.
If two vulnerability occurrences, each from different analyzers, resolve to the
same location in the code, but have vastly different CWE or WASC types, then we
could use that information to conclude that they are not the same vulnerability.
A concept similar to this was used in the [sast-benchmark project](https://gitlab.com/gitlab-org/security-products/sast-benchmark).
The CWE hierarchy was used so that the benchmark code could compare results
from multiple SAST tools, even if the same CWE identifier was not reported.
See the [`cluster_secondary_same?` and `cluster_primary_same?`](https://gitlab.com/gitlab-org/security-products/sast-benchmark/-/blob/de9031039509770940c7ffc2e9d9c6b4b023ac9f/lib/cwe_info.rb#L51-75)
functions as well as [this documentation in the scoring implementation](https://gitlab.com/gitlab-org/security-products/sast-benchmark/-/blob/de9031039509770940c7ffc2e9d9c6b4b023ac9f/lib/scoring/gitlab.rb#L91-135).
## Benefits
Requiring analyzers to explicitly define their uniqueness intent has many
benefits:
### 1. Type-Specific vs Analyzer-Specific Uniqueness Metrics
Different analyzers of the same type (for example SAST) may choose
to use different uniqueness intents (types). This gives us flexibility by
allowing analyzers to choose the uniqueness type that matches their intent.
### 2. Deduplication of Vulnerability Occurrences
Having the intent of the analyzer clearly identified will help us more
accurately track distinct vulnerabilities in an evolving code base.
We will be able to apply a more appropriate tracking or correlation strategy to
each uniqueness type.
Additionally, making use of CVE, CWE, and WASC identifiers may help us
correlate vulnerability occurences between different analyzers.
### 3. Vulnerability Tracking Improvements Occur Independently
The tracking of vulnerabilities with specific uniqueness types can be improved
independently from analyzer development. With the analyzer explicitly declaring
their intent, the uniqueness format declared by the analyzer should rarely
change. However, in the backend we may continually improve our handling of
specific uniqueness types.
For example, the tracking of a vulnerability occurrence that has a
position-based uniqueness value could be improved in an iterative manner:
* Use a hash of the file/line numbers/column (MVC - also what we're currently doing)
* Track the diffs between commits to see if the current vulnerability location
points to the same data as an older vulnerability occurrence, ignoring
any data insertions or deletions that occurred earlier in the file. In
other words, track the modifications to the file that occurred in each
commit between the two security reports to determine if the new location
was simply a result of data being inserted or deleted on previous lines.
* Parse the source code into an AST and track the position in the AST of the
data pointed to by the location. E.g. a function call might be:
* `----` a statement in a while loop
* `---` in a function body
* `--` of a function named `test_function`
* `-` in a class named `TheClass`
* in a file named `the_test_class.language`
* .
* Perhaps this should become a new uniqueness type `ast`?
epic