Specify how to identify SAST vulnerabilities
Problem to solve
To provide strong features on top of the vulnerabilities found by our tools we need to specify how to identify them in a reliable way.
For instance, users should be able to dismiss one particular occurrence or else create an issue for it. To do so we want vulnerability identifiers to be stable enough so that what's been dismissed in one particular commit is also dismissed in subsequent commits even if the code has changed. There's probably no way to make it perfect and that's why we're aiming at something that's "stable enough": identifiers should remain unchanged unless the code has significantly changed.
Further details
For SAST, the vulnerability classification is complex and made of multiple nested contexts:
-
A broad category of the risk or the software weakness
This is not yet used in Security Products but could be helpful to compare with competitors or for marketing or metrics. It's rarely provided by tools but we could probably automate this with a mapping table, for each tool.
E.g.
Injection
:- OWASP Category: https://www.owasp.org/index.php/Top_10-2017_A1-Injection
-
A type
The vulnerability type is already part of some SAST tools but not all. The type is not specific to the language or the tool so some well known databases provide common identifiers to recognize them.
E.g.
SQL Injection
-
A rule that defines one implementation potentially causing a type of vulnerability
A rule is specific to the tool because the implementation of a type of vulnerability often depends on the language. The tool also provides the matching engine capable of finding occurrences of that particular rule. That's why the available identifiers for rules are often internal and specific to the tool itself. Sometime there are also multiple implementations for the same type.
E.g.
- Find Sec Bugs Pattern: SQL_INJECTION
- GO AST rule id: G201: SQL query construction using format string
-
An occurrence of the implementation in the source code (a match for a specific rule)
This is what is currently shown as a vulnerability (one line) in the SAST report. There is no easy way to find a stable identifier for an occurrence in the source code due to how often it changes.
An occurrence could sometime be assigned a common identifier like a CVE:
For the security reports we currently combine the upper level identifiers with some contextual data (file path, code extract, CVE, etc.) to generate a unique identifier. But again, it's not reliable.
The frontiers between these levels are not crystal clear though and mileage may vary depending on the tools. This makes things even harder to define a pattern that match every cases.
In order to provide reliable security features we need to answer the following questions:
- How to identify a specific occurrence across multiple executions (at project level)?
- How to identify a vulnerability at group/instance level, across multiple projects?
Proposal
How to identify a specific occurrence across multiple executions?
This is really important as it will allow GitLab to recognize the same vulnerability between different commits/branches, allowing to produce correct diff but also reliable features like feedback, dashboard and then signal to noise enhancement.
An occurrence lives in multiple nested contexts:
- a GitLab instance
- a group
- a project
- a file (path)
- a position in that file (class, method, line)
Unless we want to always aggregate all occurrences coming from the same file, it's necessary to be able to distinguish them down to that lowest level, providing a unique identifier per occurrence. But while it is quite easy and stable down to the file level, going down to the position within the same file is really hard and requires complex solutions.
- One may want to use the line number to achieve that, but then introducing a new line in the file will generate a different identifier and break the matching.
- Using context lines (few lines before and few line after) may help reducing this issue but still, modifying the code or introducing a new line within that context will also generate a different identifier.
- One more stable way would be to count the occurrences in the file and then assign an index. e.g.
occurrence #2 of 4
. That way it would stay stable until one occurrence is added or removed. But if two occurrences positions are switched (code is refactored and moved), or if the code change generates one adding and one removal at the same time then our matching will be wrong.
Also, while less frequent, renaming a file will break the matching if the file name is part of the identifier. We probably could rely on git to catch this and update the identifier but this may be worth a dedicated issue for next iterations.
How to identify a vulnerability at group/instance level, across multiple projects?
A rule that finds similar matches on two different projects may return:
- same type identifier
- same rule Identifier
- same or different file path
- same or different line, class, method name within the file
Knowing that, it makes sense to rely on type and rule identifiers to consider vulnerabilities as similar across different projects.
Summary
One particular "occurrence" listed in the SAST report has two coordinates:
- category, type, rule
- instance, project, file, location in that file
And we could even consider the commit and the ref where the occurrence has been found.
We need to find how to name these two coordinates but let's say an occurrence has a type and a location. The identifier of one particular occurrence is made of its exact location and its exact type, thus its exact coordinates. For instance, it would be:
-
type: Find Sec Bugs' rule
SQL_INJECTION
, which is a "SQL Injection" type of vulnerability. -
location: on line 29 of file
app/src/main/java/App.java
of project LoveJava
By the way, users may want to dismiss multiple occurrences using one of the two coordinates or a combination of the two. Here are a few examples:
- all occurrences of a certain category (more generic) or a certain rule (more specific)
- all occurrences found in one particular file (like some sample code)
- a combination of the above
So we need all these nested contexts and we could easily collect them all. What's difficult is to describe the location of an occurrence in such a way that it doesn't change when the source file doesn't change much. In other words, we're looking for a robust fingerprint of the position.
What does success look like, and how can we measure that?
Clear specification on how to identify vulnerabilities.
Links / references
Other tools like Code Quality analysis probably encounters the same issues and they probably already found solutions. We should try to look at how they're handling this.