Skip to content

gitlab_schema for package metadata used by Vulnerability Scanning, License Scanning

Problem to solve

See #377955 (closed)

groupcomposition analysis needs to store metadata on packages hosted on public package registries (like rubygems.org or npmjs.com), in particular:

  • licenses of package versions, to implement ~"Category:License Compliance" (License Scanning)
  • security advisories affecting packages, to implement Category:Container Scanning and ~"Category:Dependency Scanning" (Vulnerability Scanning)

These package metadata could be stored using sbom_components and sbom_component_versions, two DB tables that have been introduced to track project dependencies, and where the backend upsert records when ingesting project SBOMs.

However,

  • We'll have package metadata for packages/components that not referenced as project dependencies (not mentioned in any SBOM).
  • Conversely, we might have components listed in SBOMs even though we don't have package metadata for them (not listed on any package registry we track).
  • SBOM components might not use the canonical package names.
  • SBOM components might not have all the details needed to resolve to a unique package, for which we have metadata. For instance, we might not have the PURL type, or we might not have package metadata for that PURL type.
  • SBOM ingestion might result in missing package metadata (like missing license data), but that wouldn't be explicit. Tracking missing metadata would pollute the tables used for SBOM ingestion, and increase the overall complexity.
  • Beyond that, tracking project dependencies listed in SBOMs and tracking package metadata from public registries are two different domains.

Proposal

  • Introduce a new gitlab_schema for the DB tables that store package metadata, and track packages, versions, their licenses, their security advisories, etc.
  • Use the PURL type, name, and version to join SBOM components with package metadata. These columns might even be used in SELECT JOIN queries. There are no Foreign Keys b/w package metadata tables and SBOM ingestions tables.
  • Also, repeat software licenses in the new gitlab_schema, to track licenses of package versions. The existing software_licenses table in gitlab_main is used to store license policies. For efficiency, we might JOIN the two tables on the SPDX index, when provided.
  • Store license data provided by users in that new gitlab_schema as well.

Also, use list table partitioning on the PURL type in the new gitlab_schema. See #377955 (comment 1139104434)

TBD: name of the gitlab_schema

Pros:

  • Good separation of concerns
  • Flexibility. The two sets of tables can be optimized for the feature (SBOM ingestion, or License & Vulnerability Scanning). In particular, they can have DB constraints that are mutually exclusive.
  • No conflicts. Large imports from the License DB doesn't lock tables used by SBOM ingestion.
  • It makes it possible to move package metadata to a separate DB. See #377955 (comment 1139104434)

Cons:

  • It takes more space. We can expect the contents of sbom_components and sbom_component_versions to be repeated in their counterparts, in the new gitlab_schema. That said, this doesn't represent much compared to all the package metadata we'll store.

Implementation plan

Links

https://docs.gitlab.com/ee/development/database/multiple_databases.html

Edited by Igor Frenkel