Improve Vulnerability Occurrence Uniqueness/Tracking
Summary
Vulnerability uniqueness and tracking is a hard topic. This issue is a proposal to change the way we determine and track distinct vulnerabilities so that it is maintainable, extendable, and flexible.
Top-level points:
- The
location
field is a misnomer - The current uniqueness field,
location_fingerprint
, composed of data from thelocation
field and implemented -> like -> this, should not be analyzer-type specific - We assume the intention of the analyzer by only hashing the resulting
location_fingerprint
value - Analyzers need a way to declare their intention in how distinct vulnerabilities are tracked or determined to be unique between different versions of code
- Specific identifiers (CWE, CVE, etc.) paired with the vulnerability location should be used to identify distinct vulnerabilities.
Background
Questions / Problems
Below are a few hard questions or problems related to dealing with vulnerability uniqueness:
- How do we track a distinct vulnerability throughout the life of a project, especially considering refactoring, analyzer improvements, and code changes?
- Which attributes of a vulnerability do we key off of?
- Can findings from multiple analyzers automatically and reliably be reduced to a single vulnerability occurrence?
- At which point has the code changed enough that it becomes a new vulnerability (similar to the Ship of Theseus thought experiment)?
- How do we know the intention of an analyzer in identifying or comparing the uniqueness of two vulnerabilities?
In addition to the problems above, we also face the following problems:
- Analyzers focus on identifying vulnerabilities, not robustly tracking them throughout the life of an evolving project
Current Implementation
Currently, vulnerability uniqueness is determined by using a hash
of data derived from the location
field defined in a vulnerability
occurrence.
- Analyzers place the information they deem critical to identifying
uniqueness into the
location
field - sast schema. - Analogous rails code then extracts specific fields from the location to construct the
location_fingerprint
value - common.rb -> parsers/../sast.rb -> locations/sast.rb. - A hash of the
location_fingerprint
value is then used to determine vulnerability uniqueness - models/vulnerabilities/finding.rb.
Below are two examples of the different ways analyzers report the vulnerability location (aka uniqueness):
SAST
security-report-schemas source
{
"file": "path/to/file",
"start_line": 100,
"end_line": 200,
"class": "a_class",
"method": "a_method"
}
Dependency Scanning
security-report-schemas source
{
"file": "path/to/file",
"dependency": {
"package": {
"name": "name_of_package"
},
"version": "1.2.3"
}
}
Coverage-guided Fuzzing
security-report-schemas source
{
"crash_address": "module+offset",
"crash_type": "USE_AFTER_FREE",
"minimized_stack_trace": "..."
}
location
Field is a Misnomer
1. The Due to the location
field also being used to determine vulnerability uniqueness, two things occur:
- the
location
field becomes a misnomer - analyzers that determine vulnerability uniqueness from more than the
error location are forced to add non-location-specific information into the
location
field
The coverage-guided fuzzing location object is an example of this - the stack trace and crash type have nothing to do with where the vulnerability occurred and everything to do with determining uniqueness.
A uniqueness- or tracking-specific field (uniqueness
or tracking
?) should be added to the required
fields of a vulnerability occurrence. This would be explicitly used to allow
analyzers to declare how a vulnerability can be determined to be distinct from
other vulnerabilities, including from previous scans on other versions of code.
2. The Uniqueness Value Should Not be Analyzer-Type Specific
The location
field (aka uniqueness field) is strictly typed in the
security-report-schemas for each analyzer type. All SAST-type analyzers are
required to structure their location field the same way, as are all DAST-type
analyzers, etc.
This has the effect of restricting future SAST / DAST/ * analyzers that may wish to determine vulnerability uniqueness differently than how vulnerability locations are declared.
For example, most SAST tools identify vulnerability uniqueness based on the
position in the code. What if a graph-based SAST tool used control-flow
paths from source -> sink
to determine vulnerability uniqueness? It is
still a SAST tool, but the uniqueness (location
) field would not support
the analyzer's intention of using source -> sink
paths as a uniqueness
metric.
3. We Assume the Intention of the Analyzer
By only hashing a value composed of data from the location
field and
using that as a uniqueness metric,
we are assuming the intention of the analyzer.
A few examples make this easier to talk about:
SAST
Vulnerability occurrences in a SAST report are often identified as unique based on their position in the source code. The intention of specifying a line number/offset/column in a file is to specify a position or statement in the code, not the actual file+line number value. Reformatting the code or inserting lines before the line containing the vulnerability should have no effect on identifying the vulnerability.
To put this another way, the literal file+line
values are not the intended
uniqueness metric - the data pointed to by the file+line
values is the
intended uniqueness value.
All analyzers (not just SAST) that have the intention of identifying vulnerability uniqueness based on their position in the source code are having their intent misassumed when their location value is hashed.
Coverage-Guided Fuzzing
All of the coverage-guided fuzzing features haven't landed yet, but it's worth talking about.
Coverage-guided fuzzing determines uniqueness based on the:
- type of crash
- a truncated stack trace
- the error address
The uniqueness intent of the coverage-guided fuzzing analyzer is to operate
on distinct sets of these values. Aside from the location
field being
a misnomer, the current implementation of comparing uniqueness does reflect
the intent of coverage-guided fuzzing.
4. Analyzers Need a Way to Declare Uniqueness or Tracking Intent
Instead of using the location
field to determine the vulnerability's
uniqueness, security reports should have a field specific to declaring the
uniqueness intent of the analyzer.
I am proposing that we define a series of typed JSON fields from which analyzers can choose. Perhaps something like this:
Position-based Uniqueness
Analyzers that depend on the position of data within project files to determine uniqueness could use this uniqueness type:
"tracking": {
"type": "position",
"file": "path/to/file",
"line": 10,
"column": 5,
}
Match-based Uniqueness - An Example of What We Don't Want
NO - I originally had this as a uniqueness type, but have realized that this is a scanning/detection rule, not a uniqueness type. Instead of removing this example, I think it is useful to leave it in as an example of what we should not add as uniqueness type
My example for this was dependency scanning - dependency scanning should continue using hash-based uniqueness based on the file path and package name.
Analyzers that depend on the existence of a string within project files to determine uniqueness could use these types.
"tracking": { "type": "match" "regex": "^\s*[^#]package_name==1.2.3$", "file": "path/to/file", }
Hash-based Uniqueness
Analyzers that wish uniqueness to be determined from a direct hash of
an analyzer-provided value could use a hash
uniqueness type.
"tracking": {
"type": "hash",
"data": "DATA TO HASH", // maybe Object()s too?
}
5. Use Specific Identifiers Paired with Uniqueness Data to Identify Distinct Vulnerabilities
Having a uniqueness value declared by the analyzer helps us identify distinct vulnerabilities. However, it is not the complete story. If we want to be able to deduplicate vulnerabilities from separate analyzers, we will also need to include vulnerability identifiers.
This is roughly how the code currently works, except that CWE and WASC identifiers are ignored. I do not believe this should be the case.
If two vulnerability occurrences, each from different analyzers, resolve to the same location in the code, but have vastly different CWE or WASC types, then we could use that information to conclude that they are not the same vulnerability.
A concept similar to this was used in the sast-benchmark project. The CWE hierarchy was used so that the benchmark code could compare results from multiple SAST tools, even if the same CWE identifier was not reported.
See the cluster_secondary_same?
and cluster_primary_same?
functions as well as this documentation in the scoring implementation.
Benefits
Requiring analyzers to explicitly define their uniqueness intent has many benefits:
1. Type-Specific vs Analyzer-Specific Uniqueness Metrics
Different analyzers of the same type (for example SAST) may choose to use different uniqueness intents (types). This gives us flexibility by allowing analyzers to choose the uniqueness type that matches their intent.
2. Deduplication of Vulnerability Occurrences
Having the intent of the analyzer clearly identified will help us more accurately track distinct vulnerabilities in an evolving code base.
We will be able to apply a more appropriate tracking or correlation strategy to each uniqueness type.
Additionally, making use of CVE, CWE, and WASC identifiers may help us correlate vulnerability occurences between different analyzers.
3. Vulnerability Tracking Improvements Occur Independently
The tracking of vulnerabilities with specific uniqueness types can be improved independently from analyzer development. With the analyzer explicitly declaring their intent, the uniqueness format declared by the analyzer should rarely change. However, in the backend we may continually improve our handling of specific uniqueness types.
For example, the tracking of a vulnerability occurrence that has a position-based uniqueness value could be improved in an iterative manner:
-
Use a hash of the file/line numbers/column (MVC - also what we're currently doing)
-
Track the diffs between commits to see if the current vulnerability location points to the same data as an older vulnerability occurrence, ignoring any data insertions or deletions that occurred earlier in the file. In other words, track the modifications to the file that occurred in each commit between the two security reports to determine if the new location was simply a result of data being inserted or deleted on previous lines.
-
Parse the source code into an AST and track the position in the AST of the data pointed to by the location. E.g. a function call might be:
-
----
a statement in a while loop -
---
in a function body -
--
of a function namedtest_function
-
-
in a class namedTheClass
- in a file named
the_test_class.language
- .
- Perhaps this should become a new uniqueness type
ast
?
-
🤖
Auto-Summary Discoto Usage
Points
Discussion points are declared by headings, list items, and single lines that start with the text (case-insensitive)
point:
. For example, the following are all valid points:
#### POINT: This is a point
* point: This is a point
+ Point: This is a point
- pOINT: This is a point
point: This is a **point**
Note that any markdown used in the point text will also be propagated into the topic summaries.
Topics
Topics can be stand-alone and contained within an issuable (epic, issue, MR), or can be inline.
Inline topics are defined by creating a new thread (discussion) where the first line of the first comment is a heading that starts with (case-insensitive)
topic:
. For example, the following are all valid topics:
# Topic: Inline discussion topic 1
## TOPIC: **{+A Green, bolded topic+}**
### tOpIc: Another topic
Quick Actions
Action Description /discuss sub-topic TITLE
Create an issue for a sub-topic. Does not work in epics /discuss link ISSUABLE-LINK
Link an issuable as a child of this discussion
Last updated by this job
-
⏫ ROOT Improve Vulnerability Tracking &4612 (closed)
Discoto Settings
---
summary:
max_items: -1
sort_by: created
sort_direction: ascending
See the settings schema for details.