Add data model and migration for vulnerabilities
What does this MR do?
Implements Data model to store Security Vulnerabilities in the Database. @gl-database here are some details to explain the need and help your review.
Usages of the DB model:
- fetch occurrences for project_id, ref, scanner_id: storage algorithm need to find similar occurrences.
- uses a join with
vulnerabilities_identifiers
andvulnerabilities_occurrence_identifiers
as we look for a particular primary identifier
- uses a join with
- delete occurrences for project_id, ref, and pipeline_id not in given list: storage algorithm need to cleanup fixed occurrences (NB: this might change to an update instead of delete in next iterations)
- fetch occurrences for ref, [project_id, category, severity, confidence]: group level dashboard will list vulnerabilities with a lot of optional criteria
- join with
projects
andnamespaces
to find projects from that particular group (and subgroups) - can also join with
vulnerabilities_identifiers
andvulnerabilities_occurrence_identifiers
to filter by identifier in next iterations
- join with
- fetch occurrences for project_id, ref, [category, severity, confidence]: project level dashboard will list vulnerabilities with a lot of optional criteria (in next iterations)
indexes:
- scanner_id (fk)
- pipeline_id (fk)
- project_id (fk)
- location_fingerprint, primary_identifier_fingerprint, project_id, scanner_id, ref (unique)
Unique composite index columns order is currenlty optimized for performance (based on estimated cardinality, see below) but that makes it less reusable for other queries. Some insights about cardinality:
- project_id is a foreign key for projects. So up to # of enabled projects
- ref may often be similar (we currently only store vulns for default_branch so == "master") but later it could also have other unique values (protected_branches). Say we limit monitoring to 5 branches max, it could be up to 5 X # of enabled projects
- scanner_id is a foreign key for scanners which are scoped by namespace. There are currently ~15 different scanners, that number may hardly exceed 100. So up to 100 X # of namespaces
- location_fingerprint is made from file_path, start_line, end_line, and maybe other properties later. This value is very likely to be unique. So ~ # of occurrences
- primary_identifier_fingerprint is made from identifier.external_type and identifier.external_id. This value is less likely to be unique than location_fingerprint as it may be reused within the same project and across projects (e.g. every projects impacted by
CVE-1234
will have a row with the same primary_identifier_fingerprint. So ~ # of occurrences
The feature is available for ultimate private projects and public projects on .com, these are the potential "enabled projects".
There is at least one row per project_id in the table but it's likely that there are more so location_fingerprint
and primary_identifier_fingerprint
are more selective than project_id
The ref
might be less selective than project_id
as I assume most of projects uses common name for default and protected branches (this could be eventually checked on .com DB)
It's hard to tell about scanner_id
... While in theory this column can have a high cardinality, projects are often limited to few languages and don't always run each category of reports so it's unlikely to have all scanners for every project.
Based on above explanations:
- would it be a good idea to add an index on the
ref
column? - is the order of the columns in unique index the best choice?
- any other feedback?
What are the relevant issue numbers?
Does this MR meet the acceptance criteria?
-
Changelog entry added, if necessary -
Documentation created/updated -
Tests added for this feature/bug -
Conforms to the code review guidelines -
Conforms to the merge request performance guidelines -
Conforms to the style guides -
Conforms to the database guides -
EE specific content should be in the top level /ee
folder -
For a paid feature, have we considered GitLab.com plans, how it works for groups, and is there a design for promoting it to users who aren't on the correct plan?