Store PURL namespace of SBOM components
Release notes
Problem to solve
When ingesting project SBOMs, we store the name of each SBOM component in the database (column sbom_occurrences.name
). However, that information isn't sufficient to perform License Scanning or Vulnerability Scanning; we also need the namespace extracted from the PURL.
See thread: !105994 (comment 1242236039)
We might normalize the PURL name and PURL namspace before storing them so that they can be directly compared to the package metadata tables used by License Scanning and Continuous Vulnerability Scanning. Names and namespaces can also be normalized in memory when performing the scans.
Further details
sbom_components.name
stores the raw component name
, which is defined in https://cyclonedx.org/docs/1.4/json/#components_items_name.
The name of the component. This will often be a shortened, single name of the component. Examples: commons-lang3 and jquery
The JSON snippet shared in https://cyclonedx.org/use-cases/#package-evaluation shows that the component name
usually don't include the namespace. The namespace or group ID needs to be extracted from the purl
field.
{
"type": "library",
"group": "org.apache.tomcat",
"name": "tomcat-catalina",
"version": "9.0.14",
"purl": "pkg:maven/org.apache.tomcat/tomcat-catalina@9.0.14"
}
Proposal
Store the normalized qualified PURL name in sbom_components.name
. The qualified name combines the PURL namespace with the PURL name.
See previous proposals
Proposals:- Combine the normalized PURL name and namespace, and store them in a single column of
sbom_components
, likepurl_qualified_name
.- Pro: It's ready to be used by License Scanning and Vulnerability Scanning.
- Pro: The component name and the PURL name CAN diverge.
- The Dependency List can present a component name that's different from the PURL name used for the scans. In particular, the reported component name can be the original name, whereas
purl_qualified_name
contains the normalized name. - We can accurately track components that don't have a PURL. (We have to drop the NOT NULL constraint on PURL type to achieve that.)
- The Dependency List can present a component name that's different from the PURL name used for the scans. In particular, the reported component name can be the original name, whereas
- Con: It's very likely that
sbom_components.name
IS repeated.
- Store the normalized PURL name and namespace separate in separate columns of
sbom_components
, likepurl_name
andpurl_namespace
.- Pro: The component name and the PURL name CAN diverge.
- Pro: We can efficiently search project dependencies by PURL name or by PURL namespace.
- Pro: We can omit the PURL name when it repeats
sbom_components.name
, to save storage (efficient but not explicit).
- Store the normalized PURL namespace in a new column of
sbom_components
, likepurl_namespace
, and usesbom_components.name
as the PURL name.- Pro:
sbom_components.name
IS NOT repeated, so we save storage. - Con: The component name and the PURL name CANNOT diverge.
- Pro:
We could also drop sbom_components.name
, and only store the PURL name and namespace.
- Pro: The information stored in the DB is optimized for the scans.
- Con: We can't track SBOM components that don't have a PURL. (That's already the case b/c of the NOT NULL constraint on
purl_type
, but these constraint could be removed.) - Con: Dropping a column is a multi-step migration.
(We should also consider the SQL queries we'll perform, and the DB indexes that will make these queries efficient.)
Implementation plan
Update the data stored in name
when ingestion runs.
-
add Sbom::PackageUrl::Normalizer#purl_qualified_name
which returns a combination of normalized namespace and normalized name https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/sbom/package_url/normalizer.rb -
update Sbom::Ingestion::OccurrenceMap
to call the above method https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/sbom/ingestion/occurrence_map.rb#L41