Store PURL namespace of SBOM components

Release notes

Problem to solve

When ingesting project SBOMs, we store the name of each SBOM component in the database (column sbom_occurrences.name). However, that information isn't sufficient to perform License Scanning or Vulnerability Scanning; we also need the namespace extracted from the PURL.

See thread: !105994 (comment 1242236039)

We might normalize the PURL name and PURL namspace before storing them so that they can be directly compared to the package metadata tables used by License Scanning and Continuous Vulnerability Scanning. Names and namespaces can also be normalized in memory when performing the scans.

Further details

sbom_components.name stores the raw component name, which is defined in https://cyclonedx.org/docs/1.4/json/#components_items_name.

The name of the component. This will often be a shortened, single name of the component. Examples: commons-lang3 and jquery

The JSON snippet shared in https://cyclonedx.org/use-cases/#package-evaluation shows that the component name usually don't include the namespace. The namespace or group ID needs to be extracted from the purl field.

    {
      "type": "library",
      "group": "org.apache.tomcat",
      "name": "tomcat-catalina",
      "version": "9.0.14",
      "purl": "pkg:maven/org.apache.tomcat/tomcat-catalina@9.0.14"
    }

Proposal

Store the normalized qualified PURL name in sbom_components.name. The qualified name combines the PURL namespace with the PURL name.

See previous proposals

Proposals:

Combine the normalized PURL name and namespace, and store them in a single column of sbom_components, like purl_qualified_name.
- Pro: It's ready to be used by License Scanning and Vulnerability Scanning.
- Pro: The component name and the PURL name CAN diverge.
  - The Dependency List can present a component name that's different from the PURL name used for the scans. In particular, the reported component name can be the original name, whereas purl_qualified_name contains the normalized name.
  - We can accurately track components that don't have a PURL. (We have to drop the NOT NULL constraint on PURL type to achieve that.)
- Con: It's very likely that sbom_components.name IS repeated.
Store the normalized PURL name and namespace separate in separate columns of sbom_components, like purl_name and purl_namespace.
- Pro: The component name and the PURL name CAN diverge.
- Pro: We can efficiently search project dependencies by PURL name or by PURL namespace.
- Pro: We can omit the PURL name when it repeats sbom_components.name, to save storage (efficient but not explicit).
Store the normalized PURL namespace in a new column of sbom_components, like purl_namespace, and use sbom_components.name as the PURL name.
- Pro: sbom_components.name IS NOT repeated, so we save storage.
- Con: The component name and the PURL name CANNOT diverge.

We could also drop sbom_components.name, and only store the PURL name and namespace.

Pro: The information stored in the DB is optimized for the scans.
Con: We can't track SBOM components that don't have a PURL. (That's already the case b/c of the NOT NULL constraint on purl_type, but these constraint could be removed.)
Con: Dropping a column is a multi-step migration.

(We should also consider the SQL queries we'll perform, and the DB indexes that will make these queries efficient.)

Implementation plan

Update the data stored in name when ingestion runs.

add Sbom::PackageUrl::Normalizer#purl_qualified_name which returns a combination of normalized namespace and normalized name https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/sbom/package_url/normalizer.rb
update Sbom::Ingestion::OccurrenceMap to call the above method https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/sbom/ingestion/occurrence_map.rb#L41

Intended users

Feature Usage Metrics

Edited Jul 25, 2023 by Igor Frenkel