Skip to content

Improve Vulnerability Occurrence Uniqueness/Tracking

Summary

Vulnerability uniqueness and tracking is a hard topic. This issue is a proposal to change the way we determine and track distinct vulnerabilities so that it is maintainable, extendable, and flexible.

Top-level points:

  1. The location field is a misnomer
  2. The current uniqueness field, location_fingerprint, composed of data from the location field and implemented -> like -> this, should not be analyzer-type specific
  3. We assume the intention of the analyzer by only hashing the resulting location_fingerprint value
  4. Analyzers need a way to declare their intention in how distinct vulnerabilities are tracked or determined to be unique between different versions of code
  5. Specific identifiers (CWE, CVE, etc.) paired with the vulnerability location should be used to identify distinct vulnerabilities.

Background

Questions / Problems

Below are a few hard questions or problems related to dealing with vulnerability uniqueness:

  • How do we track a distinct vulnerability throughout the life of a project, especially considering refactoring, analyzer improvements, and code changes?
  • Which attributes of a vulnerability do we key off of?
  • Can findings from multiple analyzers automatically and reliably be reduced to a single vulnerability occurrence?
  • At which point has the code changed enough that it becomes a new vulnerability (similar to the Ship of Theseus thought experiment)?
  • How do we know the intention of an analyzer in identifying or comparing the uniqueness of two vulnerabilities?

In addition to the problems above, we also face the following problems:

  • Analyzers focus on identifying vulnerabilities, not robustly tracking them throughout the life of an evolving project

Current Implementation

Currently, vulnerability uniqueness is determined by using a hash of data derived from the location field defined in a vulnerability occurrence.

  1. Analyzers place the information they deem critical to identifying uniqueness into the location field - sast schema.
  2. Analogous rails code then extracts specific fields from the location to construct the location_fingerprint value - common.rb -> parsers/../sast.rb -> locations/sast.rb.
  3. A hash of the location_fingerprint value is then used to determine vulnerability uniqueness - models/vulnerabilities/finding.rb.

Below are two examples of the different ways analyzers report the vulnerability location (aka uniqueness):

SAST

security-report-schemas source

{
    "file": "path/to/file",
    "start_line": 100,
    "end_line": 200,
    "class": "a_class",
    "method": "a_method"
}

Dependency Scanning

security-report-schemas source

{
    "file": "path/to/file",
    "dependency": {
        "package": {
            "name": "name_of_package"
        },
        "version": "1.2.3"
    }
}

Coverage-guided Fuzzing

security-report-schemas source

{
    "crash_address": "module+offset",
    "crash_type": "USE_AFTER_FREE",
    "minimized_stack_trace": "..."
}

1. The location Field is a Misnomer

Due to the location field also being used to determine vulnerability uniqueness, two things occur:

  • the location field becomes a misnomer
  • analyzers that determine vulnerability uniqueness from more than the error location are forced to add non-location-specific information into the location field

The coverage-guided fuzzing location object is an example of this - the stack trace and crash type have nothing to do with where the vulnerability occurred and everything to do with determining uniqueness.

A uniqueness- or tracking-specific field (uniqueness or tracking?) should be added to the required fields of a vulnerability occurrence. This would be explicitly used to allow analyzers to declare how a vulnerability can be determined to be distinct from other vulnerabilities, including from previous scans on other versions of code.

2. The Uniqueness Value Should Not be Analyzer-Type Specific

The location field (aka uniqueness field) is strictly typed in the security-report-schemas for each analyzer type. All SAST-type analyzers are required to structure their location field the same way, as are all DAST-type analyzers, etc.

This has the effect of restricting future SAST / DAST/ * analyzers that may wish to determine vulnerability uniqueness differently than how vulnerability locations are declared.

For example, most SAST tools identify vulnerability uniqueness based on the position in the code. What if a graph-based SAST tool used control-flow paths from source -> sink to determine vulnerability uniqueness? It is still a SAST tool, but the uniqueness (location) field would not support the analyzer's intention of using source -> sink paths as a uniqueness metric.

3. We Assume the Intention of the Analyzer

By only hashing a value composed of data from the location field and using that as a uniqueness metric, we are assuming the intention of the analyzer.

A few examples make this easier to talk about:

SAST

Vulnerability occurrences in a SAST report are often identified as unique based on their position in the source code. The intention of specifying a line number/offset/column in a file is to specify a position or statement in the code, not the actual file+line number value. Reformatting the code or inserting lines before the line containing the vulnerability should have no effect on identifying the vulnerability.

To put this another way, the literal file+line values are not the intended uniqueness metric - the data pointed to by the file+line values is the intended uniqueness value.

All analyzers (not just SAST) that have the intention of identifying vulnerability uniqueness based on their position in the source code are having their intent misassumed when their location value is hashed.

Coverage-Guided Fuzzing

All of the coverage-guided fuzzing features haven't landed yet, but it's worth talking about.

Coverage-guided fuzzing determines uniqueness based on the:

  • type of crash
  • a truncated stack trace
  • the error address

The uniqueness intent of the coverage-guided fuzzing analyzer is to operate on distinct sets of these values. Aside from the location field being a misnomer, the current implementation of comparing uniqueness does reflect the intent of coverage-guided fuzzing.

4. Analyzers Need a Way to Declare Uniqueness or Tracking Intent

Instead of using the location field to determine the vulnerability's uniqueness, security reports should have a field specific to declaring the uniqueness intent of the analyzer.

I am proposing that we define a series of typed JSON fields from which analyzers can choose. Perhaps something like this:

Position-based Uniqueness

Analyzers that depend on the position of data within project files to determine uniqueness could use this uniqueness type:

"tracking": {
    "type": "position",
    "file": "path/to/file",
    "line": 10,
    "column": 5,
}

Match-based Uniqueness - An Example of What We Don't Want

NO - I originally had this as a uniqueness type, but have realized that this is a scanning/detection rule, not a uniqueness type. Instead of removing this example, I think it is useful to leave it in as an example of what we should not add as uniqueness type

My example for this was dependency scanning - dependency scanning should continue using hash-based uniqueness based on the file path and package name.

Analyzers that depend on the existence of a string within project files to determine uniqueness could use these types.

"tracking": { "type": "match" "regex": "^\s*[^#]package_name==1.2.3$", "file": "path/to/file", }

Hash-based Uniqueness

Analyzers that wish uniqueness to be determined from a direct hash of an analyzer-provided value could use a hash uniqueness type.

"tracking": {
    "type": "hash",
    "data": "DATA TO HASH", // maybe Object()s too?
}

5. Use Specific Identifiers Paired with Uniqueness Data to Identify Distinct Vulnerabilities

Having a uniqueness value declared by the analyzer helps us identify distinct vulnerabilities. However, it is not the complete story. If we want to be able to deduplicate vulnerabilities from separate analyzers, we will also need to include vulnerability identifiers.

This is roughly how the code currently works, except that CWE and WASC identifiers are ignored. I do not believe this should be the case.

If two vulnerability occurrences, each from different analyzers, resolve to the same location in the code, but have vastly different CWE or WASC types, then we could use that information to conclude that they are not the same vulnerability.

A concept similar to this was used in the sast-benchmark project. The CWE hierarchy was used so that the benchmark code could compare results from multiple SAST tools, even if the same CWE identifier was not reported.

See the cluster_secondary_same? and cluster_primary_same? functions as well as this documentation in the scoring implementation.

Benefits

Requiring analyzers to explicitly define their uniqueness intent has many benefits:

1. Type-Specific vs Analyzer-Specific Uniqueness Metrics

Different analyzers of the same type (for example SAST) may choose to use different uniqueness intents (types). This gives us flexibility by allowing analyzers to choose the uniqueness type that matches their intent.

2. Deduplication of Vulnerability Occurrences

Having the intent of the analyzer clearly identified will help us more accurately track distinct vulnerabilities in an evolving code base.

We will be able to apply a more appropriate tracking or correlation strategy to each uniqueness type.

Additionally, making use of CVE, CWE, and WASC identifiers may help us correlate vulnerability occurences between different analyzers.

3. Vulnerability Tracking Improvements Occur Independently

The tracking of vulnerabilities with specific uniqueness types can be improved independently from analyzer development. With the analyzer explicitly declaring their intent, the uniqueness format declared by the analyzer should rarely change. However, in the backend we may continually improve our handling of specific uniqueness types.

For example, the tracking of a vulnerability occurrence that has a position-based uniqueness value could be improved in an iterative manner:

  • Use a hash of the file/line numbers/column (MVC - also what we're currently doing)

  • Track the diffs between commits to see if the current vulnerability location points to the same data as an older vulnerability occurrence, ignoring any data insertions or deletions that occurred earlier in the file. In other words, track the modifications to the file that occurred in each commit between the two security reports to determine if the new location was simply a result of data being inserted or deleted on previous lines.

  • Parse the source code into an AST and track the position in the AST of the data pointed to by the location. E.g. a function call might be:

    • ---- a statement in a while loop
    • --- in a function body
    • -- of a function named test_function
    • - in a class named TheClass
    • in a file named the_test_class.language
    • .
    • Perhaps this should become a new uniqueness type ast?

Auto-Summary 🤖

Discoto Usage

Points

Discussion points are declared by headings, list items, and single lines that start with the text (case-insensitive) point:. For example, the following are all valid points:

  • #### POINT: This is a point
  • * point: This is a point
  • + Point: This is a point
  • - pOINT: This is a point
  • point: This is a **point**

Note that any markdown used in the point text will also be propagated into the topic summaries.

Topics

Topics can be stand-alone and contained within an issuable (epic, issue, MR), or can be inline.

Inline topics are defined by creating a new thread (discussion) where the first line of the first comment is a heading that starts with (case-insensitive) topic:. For example, the following are all valid topics:

  • # Topic: Inline discussion topic 1
  • ## TOPIC: **{+A Green, bolded topic+}**
  • ### tOpIc: Another topic

Quick Actions

Action Description
/discuss sub-topic TITLE Create an issue for a sub-topic. Does not work in epics
/discuss link ISSUABLE-LINK Link an issuable as a child of this discussion

Last updated by this job

Discoto Settings
---
summary:
  max_items: -1
  sort_by: created
  sort_direction: ascending

See the settings schema for details.

Edited by 🤖 GitLab Bot 🤖