Add data model and migration for vulnerabilities (!6896) · Merge requests · GitLab.org / GitLab

Olivier Gonzalez requested to merge 7046_vulnerabilities_data_model into master Aug 14, 2018

What does this MR do?

Implements Data model to store Security Vulnerabilities in the Database. @gl-database here are some details to explain the need and help your review.

Usages of the DB model:

fetch occurrences for project_id, ref, scanner_id: storage algorithm need to find similar occurrences.
- uses a join with vulnerabilities_identifiers and vulnerabilities_occurrence_identifiers as we look for a particular primary identifier
delete occurrences for project_id, ref, and pipeline_id not in given list: storage algorithm need to cleanup fixed occurrences (NB: this might change to an update instead of delete in next iterations)
fetch occurrences for ref, [project_id, category, severity, confidence]: group level dashboard will list vulnerabilities with a lot of optional criteria
- join with projects and namespaces to find projects from that particular group (and subgroups)
- can also join with vulnerabilities_identifiers and vulnerabilities_occurrence_identifiers to filter by identifier in next iterations
fetch occurrences for project_id, ref, [category, severity, confidence]: project level dashboard will list vulnerabilities with a lot of optional criteria (in next iterations)

indexes:

scanner_id (fk)
pipeline_id (fk)
project_id (fk)
location_fingerprint, primary_identifier_fingerprint, project_id, scanner_id, ref (unique)

Unique composite index columns order is currenlty optimized for performance (based on estimated cardinality, see below) but that makes it less reusable for other queries. Some insights about cardinality:

project_id is a foreign key for projects. So up to # of enabled projects
ref may often be similar (we currently only store vulns for default_branch so == "master") but later it could also have other unique values (protected_branches). Say we limit monitoring to 5 branches max, it could be up to 5 X # of enabled projects
scanner_id is a foreign key for scanners which are scoped by namespace. There are currently ~15 different scanners, that number may hardly exceed 100. So up to 100 X # of namespaces
location_fingerprint is made from file_path, start_line, end_line, and maybe other properties later. This value is very likely to be unique. So ~ # of occurrences
primary_identifier_fingerprint is made from identifier.external_type and identifier.external_id. This value is less likely to be unique than location_fingerprint as it may be reused within the same project and across projects (e.g. every projects impacted by CVE-1234 will have a row with the same primary_identifier_fingerprint. So ~ # of occurrences

The feature is available for ultimate private projects and public projects on .com, these are the potential "enabled projects". There is at least one row per project_id in the table but it's likely that there are more so location_fingerprint and primary_identifier_fingerprint are more selective than project_id The ref might be less selective than project_id as I assume most of projects uses common name for default and protected branches (this could be eventually checked on .com DB) It's hard to tell about scanner_id... While in theory this column can have a high cardinality, projects are often limited to few languages and don't always run each category of reports so it's unlikely to have all scanners for every project.

Based on above explanations:

would it be a good idea to add an index on the ref column?
is the order of the columns in unique index the best choice?
any other feedback?

What are the relevant issue numbers?

#7046 (closed)

Does this MR meet the acceptance criteria?

Changelog entry added, if necessary
Documentation created/updated
Tests added for this feature/bug
Conforms to the code review guidelines
Conforms to the merge request performance guidelines
Conforms to the style guides
Conforms to the database guides
EE specific content should be in the top level /ee folder
For a paid feature, have we considered GitLab.com plans, how it works for groups, and is there a design for promoting it to users who aren't on the correct plan?

Edited Sep 24, 2018 by Kamil Trzciński

Add data model and migration for vulnerabilities

What does this MR do?

What are the relevant issue numbers?

Does this MR meet the acceptance criteria?

Merge request reports