Improve Vulnerability Occurrence Uniqueness/Tracking
## Summary Vulnerability uniqueness and tracking is a hard topic. This issue is a proposal to change the way we determine and track distinct vulnerabilities so that it is maintainable, extendable, and flexible. Top-level points: 1. The `location` field is a misnomer 2. The current uniqueness field, `location_fingerprint`, composed of data from the `location` field and [implemented](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/parsers/security/common.rb#L63) -> [like](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/parsers/security/sast.rb#L14-21) -> [this](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/reports/security/locations/sast.rb#L25-27), should not be analyzer-type specific 3. We assume the intention of the analyzer by only hashing the resulting `location_fingerprint` value 4. Analyzers need a way to declare their intention in how distinct vulnerabilities are tracked or determined to be unique between different versions of code 5. Specific identifiers (CWE, CVE, etc.) paired with the vulnerability location *should* be used to identify distinct vulnerabilities. ## Background ### Questions / Problems Below are a few hard questions or problems related to dealing with vulnerability uniqueness: * How do we track a distinct vulnerability throughout the life of a project, especially considering refactoring, analyzer improvements, and code changes? * Which attributes of a vulnerability do we key off of? * Can findings from multiple analyzers automatically and reliably be reduced to a single vulnerability occurrence? * At which point has the code changed enough that it becomes a new vulnerability (similar to the [Ship of Theseus thought experiment](https://en.wikipedia.org/wiki/Ship_of_Theseus))? * How do we know the intention of an analyzer in identifying or comparing the uniqueness of two vulnerabilities? In addition to the problems above, we also face the following problems: * Analyzers focus on *identifying* vulnerabilities, not robustly tracking them throughout the life of an evolving project ### Current Implementation Currently, vulnerability uniqueness is determined by using a hash of data derived from the `location` field defined in a vulnerability occurrence. 1. Analyzers place the information they deem critical to identifying uniqueness into the `location` field - [sast schema](https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/fde94b0c22c107911d9f46ff0f6c06e3464c621c/src/sast-report-format.json). 2. Analogous rails code then extracts specific fields from the location to construct the `location_fingerprint` value - [common.rb](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/parsers/security/common.rb#L63) -> [parsers/../sast.rb](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/parsers/security/sast.rb#L14-21) -> [locations/sast.rb](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/reports/security/locations/sast.rb#L25-27). 3. A hash of the `location_fingerprint` value is then used to determine vulnerability uniqueness - [models/vulnerabilities/finding.rb](https://gitlab.com/gitlab-org/gitlab/-/blob/576fea813d60c00ce483738e18cb417a4e0c512e/ee/app/models/vulnerabilities/finding.rb#L294-296). Below are two examples of the different ways analyzers report the vulnerability location (aka uniqueness): #### SAST [security-report-schemas source](https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/fde94b0c22c107911d9f46ff0f6c06e3464c621c/src/sast-report-format.json) ```json { "file": "path/to/file", "start_line": 100, "end_line": 200, "class": "a_class", "method": "a_method" } ``` #### Dependency Scanning [security-report-schemas source](https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/fde94b0c22c107911d9f46ff0f6c06e3464c621c/src/dependency-scanning-report-format.json) ```json { "file": "path/to/file", "dependency": { "package": { "name": "name_of_package" }, "version": "1.2.3" } } ``` #### Coverage-guided Fuzzing [security-report-schemas source](https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/723e4cf4b35f6030ff97d09ff4e7df882e2310cd/src/fuzz-coverage-report-format.json) ```json { "crash_address": "module+offset", "crash_type": "USE_AFTER_FREE", "minimized_stack_trace": "..." } ``` ## 1. The `location` Field is a Misnomer Due to the `location` field also being used to determine vulnerability uniqueness, two things occur: * the `location` field becomes a misnomer * analyzers that determine vulnerability uniqueness from more than the error location are forced to add non-location-specific information into the `location` field The coverage-guided fuzzing location object is an example of this - the stack trace and crash type have nothing to do with where the vulnerability occurred and everything to do with determining uniqueness. A uniqueness- or tracking-specific field (`uniqueness` or `tracking`?) should be added to the required fields of a vulnerability occurrence. This would be explicitly used to allow analyzers to declare how a vulnerability can be determined to be distinct from other vulnerabilities, including from previous scans on other versions of code. ## 2. The Uniqueness Value Should Not be Analyzer-Type Specific The `location` field (aka uniqueness field) is strictly typed in the security-report-schemas for each analyzer type. All SAST-type analyzers are required to structure their location field the same way, as are all DAST-type analyzers, etc. This has the effect of restricting future SAST / DAST/ * analyzers that may wish to determine vulnerability uniqueness differently than how vulnerability locations are declared. For example, most SAST tools identify vulnerability uniqueness based on the position in the code. What if a graph-based SAST tool used control-flow paths from `source -> sink` to determine vulnerability uniqueness? It is still a SAST tool, but the uniqueness (`location`) field would not support the analyzer's intention of using `source -> sink` paths as a uniqueness metric. ## 3. We Assume the Intention of the Analyzer By only hashing a value composed of data from the `location` field and [using that as a uniqueness metric](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/app/models/vulnerabilities/finding.rb#L287-296), we are assuming the intention of the analyzer. A few examples make this easier to talk about: **SAST** Vulnerability occurrences in a SAST report are often identified as unique based on their position in the source code. The intention of specifying a line number/offset/column in a file is to specify a position or statement in the code, not the actual file+line number value. Reformatting the code or inserting lines before the line containing the vulnerability should have no effect on identifying the vulnerability. To put this another way, the literal `file+line` values are not the intended uniqueness metric - the data pointed to by the `file+line` values is the intended uniqueness value. All analyzers (not just SAST) that have the intention of identifying vulnerability uniqueness based on their position in the source code are having their intent misassumed when their location value is hashed. **Coverage-Guided Fuzzing** All of the coverage-guided fuzzing features haven't landed yet, but it's worth talking about. Coverage-guided [fuzzing determines uniqueness](https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/merge_requests/42#note_383256594) based on the: * type of crash * a truncated stack trace * the error address The uniqueness intent of the coverage-guided fuzzing analyzer *is* to operate on distinct sets of these values. Aside from the `location` field being a misnomer, the current implementation of comparing uniqueness *does* reflect the intent of coverage-guided fuzzing. ## 4. Analyzers Need a Way to Declare Uniqueness or Tracking Intent Instead of using the `location` field to determine the vulnerability's uniqueness, security reports should have a field specific to declaring the uniqueness intent of the analyzer. I am proposing that we define a series of typed JSON fields from which analyzers can choose. Perhaps something like this: **Position-based Uniqueness** Analyzers that depend on the position of data within project files to determine uniqueness could use this uniqueness type: ```json "tracking": { "type": "position", "file": "path/to/file", "line": 10, "column": 5, } ``` **Match-based Uniqueness - An Example of What We Don't Want** *NO* - I originally had this as a uniqueness type, but have realized that this is a scanning/detection rule, not a uniqueness type. Instead of removing this example, I think it is useful to leave it in as an example of what we should not add as uniqueness type My example for this was dependency scanning - dependency scanning should [continue using hash-based uniqueness based on the file path and package name](https://gitlab.com/gitlab-org/gitlab/-/blob/0d013cc934887497874f2f1dc2411015e8575084/ee/lib/gitlab/ci/reports/security/locations/dependency_scanning.rb#L21-23). > Analyzers that depend on the existence of a string within project files to > determine uniqueness could use these types. > > > "tracking": { > "type": "match" > "regex": "^\s*[^#]package_name==1.2.3$", > "file": "path/to/file", > } **Hash-based Uniqueness** Analyzers that wish uniqueness to be determined from a direct hash of an analyzer-provided value could use a `hash` uniqueness type. ```js "tracking": { "type": "hash", "data": "DATA TO HASH", // maybe Object()s too? } ``` ## 5. Use Specific Identifiers Paired with Uniqueness Data to Identify Distinct Vulnerabilities Having a uniqueness value declared by the analyzer helps us identify distinct vulnerabilities. However, it is not the complete story. If we want to be able to deduplicate vulnerabilities from separate analyzers, we will also need to include vulnerability identifiers. This is roughly [how the code currently works](https://gitlab.com/gitlab-org/gitlab/-/blob/576fea813d60c00ce483738e18cb417a4e0c512e/ee/app/services/security/merge_reports_service.rb#L85-108), except that CWE and WASC identifiers are ignored. I do not believe this should be the case. If two vulnerability occurrences, each from different analyzers, resolve to the same location in the code, but have vastly different CWE or WASC types, then we could use that information to conclude that they are not the same vulnerability. A concept similar to this was used in the [sast-benchmark project](https://gitlab.com/gitlab-org/security-products/sast-benchmark). The CWE hierarchy was used so that the benchmark code could compare results from multiple SAST tools, even if the same CWE identifier was not reported. See the [`cluster_secondary_same?` and `cluster_primary_same?`](https://gitlab.com/gitlab-org/security-products/sast-benchmark/-/blob/de9031039509770940c7ffc2e9d9c6b4b023ac9f/lib/cwe_info.rb#L51-75) functions as well as [this documentation in the scoring implementation](https://gitlab.com/gitlab-org/security-products/sast-benchmark/-/blob/de9031039509770940c7ffc2e9d9c6b4b023ac9f/lib/scoring/gitlab.rb#L91-135). ## Benefits Requiring analyzers to explicitly define their uniqueness intent has many benefits: ### 1. Type-Specific vs Analyzer-Specific Uniqueness Metrics Different analyzers of the same type (for example SAST) may choose to use different uniqueness intents (types). This gives us flexibility by allowing analyzers to choose the uniqueness type that matches their intent. ### 2. Deduplication of Vulnerability Occurrences Having the intent of the analyzer clearly identified will help us more accurately track distinct vulnerabilities in an evolving code base. We will be able to apply a more appropriate tracking or correlation strategy to each uniqueness type. Additionally, making use of CVE, CWE, and WASC identifiers may help us correlate vulnerability occurences between different analyzers. ### 3. Vulnerability Tracking Improvements Occur Independently The tracking of vulnerabilities with specific uniqueness types can be improved independently from analyzer development. With the analyzer explicitly declaring their intent, the uniqueness format declared by the analyzer should rarely change. However, in the backend we may continually improve our handling of specific uniqueness types. For example, the tracking of a vulnerability occurrence that has a position-based uniqueness value could be improved in an iterative manner: * Use a hash of the file/line numbers/column (MVC - also what we're currently doing) * Track the diffs between commits to see if the current vulnerability location points to the same data as an older vulnerability occurrence, ignoring any data insertions or deletions that occurred earlier in the file. In other words, track the modifications to the file that occurred in each commit between the two security reports to determine if the new location was simply a result of data being inserted or deleted on previous lines. * Parse the source code into an AST and track the position in the AST of the data pointed to by the location. E.g. a function call might be: * `----` a statement in a while loop * `---` in a function body * `--` of a function named `test_function` * `-` in a class named `TheClass` * in a file named `the_test_class.language` * . * Perhaps this should become a new uniqueness type `ast`?
epic