Normalize names of Python SBoM components on ingestion
Why are we doing this work
When ingesting SBoM components, we currently use the literal name received from the report. Many languages such as Python consider package names to be case-insensitive, so we need to normalize the component names when ingesting so that we do not end up with different records for the same component.
If we don't normalize the names, we might end up with duplicates, that is multiple records of sbom_components referring for a single component (served by a package registry).
It would be impossible to reference a single sbom_components from a vulnerability advisory using a foreign key.
In particular, names of Python packages should be normalized, so that they can be compared as documented in PEP0426:
All comparisons of distribution names MUST be case insensitive, and MUST consider hyphens and underscores to be equivalent.
See normalization of Python package names in Gemnasium: https://gitlab.com/gitlab-org/security-products/analyzers/gemnasium/-/blob/v3.9.6/advisory/repo.go#L217
Context: This was raised by @fcatteau in a previous discussion.
See comment
I'm assuming that we store the normalized component names in the database, like
pillowfor Python packagePillow. This way each advisory affects exactly one component, and we can have avulnerability_advisories.component_idforeign key....
I'm not concerned seeding the DB with gemnasium-db b/c its YAML files should contain normalized names, in the
package_slug.However, that's not the case for project SBOMs being ingested. It's important to normalized component names when ingesting the SBOMs, otherwise we'll end up with duplicates in the case of Python (multiple
sbom_componentsreferring to the same Python package). Then it would be impossible to setvulnerability_advisories.component_idto correct value.
Relevant links
Non-functional requirements
-
Documentation: -
Feature flag: -
Performance: -
Testing:
Implementation plan
-
In ee/app/services/sbom/ingestion/tasks/ingest_components.rb, normalize the component name prior to insertion. For example:diff --git a/ee/app/services/sbom/ingestion/tasks/ingest_components.rb b/ee/app/services/sbom/ingestion/tasks/ingest_components.rb index ddc187bdadf..075ab43a5be 100644 --- a/ee/app/services/sbom/ingestion/tasks/ingest_components.rb +++ b/ee/app/services/sbom/ingestion/tasks/ingest_components.rb @@ -24,7 +24,11 @@ def after_ingest def attributes occurrence_maps.map do |occurrence_map| - occurrence_map.to_h.slice(*COMPONENT_ATTRIBUTES) + component_attributes = occurrence_map.to_h.slice(*COMPONENT_ATTRIBUTES) + + component_attributes[:name] = normalize_name(component_attributes[:name]) + + component_attributes end end -
Add model validation to ensure that names Python packages only contain normalized characters. -
Add DB constraint to ensure that sbom_components.nameonly contains normalized chars whenpurl_typeispypi.
Unicode characters must be supported.
Verification steps
- Set up a project using the NodeJS/Express template, enable SBOM ingestion, and enable Dependency Scanning as documented in #364709 (closed).
- Add a
requirements.txtfile that contains Django. A new pipeline is triggered. - List components for that pipelines as documented in #364709 (closed), and check the following:
-
It contains a record such as name is cookie-parser, and PURL type isnpm. -
It contains a record such as name is django, and PURL type ispypi. -
It does NOT contain a record such as name is Django, and PURL type ispypi.
-
This proves that names of pypi components are normalized, but names of npm components are not.