Improve vulnerability recognition when its location has changed

Problem to solve

In order to provide reliable data and feature based on time related properties, we need to be able to identify the same occurrence between reporting tools executions. We currently identify a vulnerability via two coordinate fingerprints made from:

primary identifier
location (file path, start line, end line, etc.)

For some kinds of vulnerabilities the location may change as the code is updated. For instance, let's consider a line of code that is affected by a vulnerability reported by SAST. The location of the vulnerability is a file path and a line number. If some new lines are inserted at the beginning of affected file, the line number changes and the location fingerprint changes too. Because of that, the vulnerability is considered as a new one when in fact it's been seen before. This flaw makes it impossible to tell when a vulnerability was first introduced in the source code.

Further details

We need to track and update the locations of the vulnerabilities whenever the code is updated. This is necessary so that users know the context in which a vulnerability has been introduced:

date
commit SHA
author

Also, it makes possible to measure the time it took for the team to fix a vulnerability issue, and to build various metrics on top of this information.

Tracking the location of a vulnerability is also a step towards improving the vulnerability feedback. Right now a vulnerability feedback is lost as soon as the location changes, and users then have to create a new one. The vulnerability feedback wouldn't benefit from the location update right away though, because it currently relies on the project_fingerprint. First step is to replace the project_fingerprint with the location fingerprint (and primary identifier).

Proposal

Before trying to match existing vulnerabilities (in DB) with the new ones coming from the latest report, we need to convert their location in the context of the latest commit.

By leveraging the git diff, we can map file paths and line numbers to get an updated position for existing vulnerabilities. Then fingerprints comparison will get more matches and we'll avoid creating new vulnerabilities for existing occurrences.

Additionally, we should update the docs to remove the "Alpha" warning for Interacting with Vulnerabilities once this change has been merged: https://docs.gitlab.com/ee/user/application_security/#interacting-with-the-vulnerabilities

Example

Let's say branch head is at commit A and there is no stored vulnerabilities yet for that branch.

After a push, head is now at commit B. There is no existing vulnerabilities for that branch, so nothing to compare with. In other words, all reported vulnerabilities are Added ones, so for each of them we create a Vulnerabilities::Occurrence.

On next push, head is now at commit C.

For each existing vulnerability on that branch (those reported in the previous commit B), we try to map their location in the source code as of C revision by leveraging the git diff (B->C). This can be done thanks to the existing CompareService and LineMapper classes.

This mapping leads to two possible outcomes:

Occurrence location is not found anymore in the new revision
Occurrence location is found and we can provide updated values (file_path, start_line, end_line)

For each Vulnerabilities::Occurrence that is not found, we consider it as fixed. In other words, we were not able to find the same line of code (content) in the new revision of the source code, so the associated vulnerability is gone too. NB: this is an assumption that relies on the accuracy of the git diff.

For each Vulnerabilities::Occurrence that has an updated location, we now have up to date values so we can generate a new location_fingerprint.

Finally, we compare these updated Vulnerabilities::Occurrence with the ones reported on this new commit C (using primary identifier and location_fingerprint). This allows us to get the list of the truly Added vulnerabilities. For each of them, we create a Vulnerabilities::Occurrence.

At that point, all Vulnerabilities::Occurrence for that branch have a location on the latest revision C, or are flagged as fixed.

Here is a diagram describing this workflow when pushing the `C` commit:

This approach also makes it really easy to compare reports between arbitrary commits... which is what we're doing on the MR widget. So I think we cam probably provide a common logic to handle reports diff, not only when updating DB records but also on Merge Request. In other words, this would provide a more reliable MR widget with more accurate added/fixed/existing listings.

PoC

This issues requires a lot of work so before going further we need a PoC to validate the proposal.

See it in action here: https://www.youtube.com/watch?v=gD5f9oBm8Lk

What does success look like, and how can we measure that?

A vulnerability that have its location updated because of source code update is recognized and matched with it's previous location, ending up with a record update instead of inserting a new one and a delete the existing one.

Links / references

Edited Apr 13, 2020 by Lucas Charles