Revisit Data model for vulnerabilities

Problem to solve

The current data model that allows to store vulnerabilities in database doesn't cover the domain needs:

Further details

The current approach consists of storing every pipeline's report in DB with minimal tracking of previously existing data. This provides INSERT performance but doesn't cover our needs, as stated above. This also causes more complex READ queries as we have to always figure out what is the last pipeline to get the most up to date state.

The domain reality is that the lifecycle of a vulnerability is not really tied to pipelines. The pipeline is just a mean to calculate the vulnerabilities state for a given project (a given branch) at a given point in time. By storing relations between vulnerabilities and every pipelines we are making the model more complex without much gain for the domain needs.

What we really care about is knowing that a vulnerability has been introduced at a specific point in time and space and being able to track it over time, between reports (commits). There is no need to link to all intermediary pipelines where the vulnerability was reported.

And this actually implies to review all the existing records and compare them with the new reported vulnerabilities. The current model assumes that this can be done by just comparing columns (fingerprints), but it requires some complex computations (track line and file changes between commits).

By moving to a real stateful model, we could solve our issues.

Proposal

Get rid of the Vulnerabilities::OccurrencePipeline model.

Have one Vulnerabilities::Occurrence record per (monitored) branch that we always keep up to date between pipelines. This allows to:

  • track Occurrence in time by updating every existing record to their new location in the last version of the source code if necessary, or by flagging them as fixed if they are gone.
  • provide meaningful history by knowing how many added and fixed vulnerabilities we have per day using timestamps (created_at, fixed_at)

This will also provide the following benefits:

  • make it easy and efficient to get the current state of a given project (branch) or for a whole group (no need to join with pipelines)
  • store less records, making it possible to do it for every branches which allows to streamline our codebase and have a unique interface EDIT: this is no longer a goal and even if we could store all branches this won't provide the benefits we thought initially.
  • simplify model

What does success look like, and how can we measure that?

The data model suits the domain needs. There is no direct measure we can take, but this achievement will unlock user facing features.

What is the type of buyer?

GitLab Ultimate

Links / references

Edited by Olivier Gonzalez