Skip to content

Add data model and migration for vulnerabilities

Olivier Gonzalez requested to merge 7046_vulnerabilities_data_model into master

What does this MR do?

Implements Data model to store Security Vulnerabilities in the Database. @gl-database here are some details to explain the need and help your review.

Usages of the DB model:

  • fetch occurrences for project_id, ref, scanner_id: storage algorithm need to find similar occurrences.
    • uses a join with vulnerabilities_identifiers and vulnerabilities_occurrence_identifiers as we look for a particular primary identifier
  • delete occurrences for project_id, ref, and pipeline_id not in given list: storage algorithm need to cleanup fixed occurrences (NB: this might change to an update instead of delete in next iterations)
  • fetch occurrences for ref, [project_id, category, severity, confidence]: group level dashboard will list vulnerabilities with a lot of optional criteria
    • join with projects and namespaces to find projects from that particular group (and subgroups)
    • can also join with vulnerabilities_identifiers and vulnerabilities_occurrence_identifiers to filter by identifier in next iterations
  • fetch occurrences for project_id, ref, [category, severity, confidence]: project level dashboard will list vulnerabilities with a lot of optional criteria (in next iterations)

indexes:

  • scanner_id (fk)
  • pipeline_id (fk)
  • project_id (fk)
  • location_fingerprint, primary_identifier_fingerprint, project_id, scanner_id, ref (unique)

Unique composite index columns order is currenlty optimized for performance (based on estimated cardinality, see below) but that makes it less reusable for other queries. Some insights about cardinality:

  • project_id is a foreign key for projects. So up to # of enabled projects
  • ref may often be similar (we currently only store vulns for default_branch so == "master") but later it could also have other unique values (protected_branches). Say we limit monitoring to 5 branches max, it could be up to 5 X # of enabled projects
  • scanner_id is a foreign key for scanners which are scoped by namespace. There are currently ~15 different scanners, that number may hardly exceed 100. So up to 100 X # of namespaces
  • location_fingerprint is made from file_path, start_line, end_line, and maybe other properties later. This value is very likely to be unique. So ~ # of occurrences
  • primary_identifier_fingerprint is made from identifier.external_type and identifier.external_id. This value is less likely to be unique than location_fingerprint as it may be reused within the same project and across projects (e.g. every projects impacted by CVE-1234 will have a row with the same primary_identifier_fingerprint. So ~ # of occurrences

The feature is available for ultimate private projects and public projects on .com, these are the potential "enabled projects". There is at least one row per project_id in the table but it's likely that there are more so location_fingerprint and primary_identifier_fingerprint are more selective than project_id The ref might be less selective than project_id as I assume most of projects uses common name for default and protected branches (this could be eventually checked on .com DB) It's hard to tell about scanner_id... While in theory this column can have a high cardinality, projects are often limited to few languages and don't always run each category of reports so it's unlikely to have all scanners for every project.

Based on above explanations:

  • would it be a good idea to add an index on the ref column?
  • is the order of the columns in unique index the best choice?
  • any other feedback?

What are the relevant issue numbers?

#7046 (closed)

Does this MR meet the acceptance criteria?

Edited by Kamil Trzciński

Merge request reports