gitlab_schema for package metadata used by Vulnerability Scanning, License Scanning
Problem to solve
See #377955 (closed)
groupcomposition analysis needs to store metadata on packages hosted on public package registries (like rubygems.org or npmjs.com), in particular:
- licenses of package versions, to implement ~"Category:License Compliance" (License Scanning)
- security advisories affecting packages, to implement Category:Container Scanning and ~"Category:Dependency Scanning" (Vulnerability Scanning)
These package metadata could be stored using sbom_components and sbom_component_versions, two DB tables that have been introduced to track project dependencies, and where the backend upsert records when ingesting project SBOMs.
However,
- We'll have package metadata for packages/components that not referenced as project dependencies (not mentioned in any SBOM).
- Conversely, we might have components listed in SBOMs even though we don't have package metadata for them (not listed on any package registry we track).
- SBOM components might not use the canonical package names.
- SBOM components might not have all the details needed to resolve to a unique package, for which we have metadata. For instance, we might not have the PURL type, or we might not have package metadata for that PURL type.
- SBOM ingestion might result in missing package metadata (like missing license data), but that wouldn't be explicit. Tracking missing metadata would pollute the tables used for SBOM ingestion, and increase the overall complexity.
- Beyond that, tracking project dependencies listed in SBOMs and tracking package metadata from public registries are two different domains.
Proposal
- Introduce a new
gitlab_schemafor the DB tables that store package metadata, and track packages, versions, their licenses, their security advisories, etc. - Use the PURL type, name, and version to join SBOM components with package metadata. These columns might even be used in SELECT JOIN queries. There are no Foreign Keys b/w package metadata tables and SBOM ingestions tables.
- Also, repeat software licenses in the new
gitlab_schema, to track licenses of package versions. The existingsoftware_licensestable ingitlab_mainis used to store license policies. For efficiency, we might JOIN the two tables on the SPDX index, when provided. - Store license data provided by users in that new
gitlab_schemaas well.
Also, use list table partitioning on the PURL type in the new gitlab_schema. See #377955 (comment 1139104434)
TBD: name of the gitlab_schema
Pros:
- Good separation of concerns
- Flexibility. The two sets of tables can be optimized for the feature (SBOM ingestion, or License & Vulnerability Scanning). In particular, they can have DB constraints that are mutually exclusive.
- No conflicts. Large imports from the License DB doesn't lock tables used by SBOM ingestion.
- It makes it possible to move package metadata to a separate DB. See #377955 (comment 1139104434)
Cons:
- It takes more space. We can expect the contents of
sbom_componentsandsbom_component_versionsto be repeated in their counterparts, in the newgitlab_schema. That said, this doesn't represent much compared to all the package metadata we'll store.
Implementation plan
-
update schemas_to_base_models to add gitlab_pm -
add new pm_tables undergitlab_pmalias in https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/database/gitlab_schemas.yml
Links
https://docs.gitlab.com/ee/development/database/multiple_databases.html
Edited by Igor Frenkel