Normalize names of Python SBoM components on ingestion

Why are we doing this work

When ingesting SBoM components, we currently use the literal name received from the report. Many languages such as Python consider package names to be case-insensitive, so we need to normalize the component names when ingesting so that we do not end up with different records for the same component.

If we don't normalize the names, we might end up with duplicates, that is multiple records of sbom_components referring for a single component (served by a package registry). It would be impossible to reference a single sbom_components from a vulnerability advisory using a foreign key.

In particular, names of Python packages should be normalized, so that they can be compared as documented in PEP0426:

All comparisons of distribution names MUST be case insensitive, and MUST consider hyphens and underscores to be equivalent.

See normalization of Python package names in Gemnasium: https://gitlab.com/gitlab-org/security-products/analyzers/gemnasium/-/blob/v3.9.6/advisory/repo.go#L217

Context: This was raised by @fcatteau in a previous discussion.

See comment

I'm assuming that we store the normalized component names in the database, like pillow for Python package Pillow. This way each advisory affects exactly one component, and we can have a vulnerability_advisories.component_id foreign key.

...

I'm not concerned seeding the DB with gemnasium-db b/c its YAML files should contain normalized names, in the package_slug.

However, that's not the case for project SBOMs being ingested. It's important to normalized component names when ingesting the SBOMs, otherwise we'll end up with duplicates in the case of Python (multiple sbom_components referring to the same Python package). Then it would be impossible to set vulnerability_advisories.component_id to correct value.

Relevant links

https://gitlab.com/gitlab-org/security-products/analyzers/gemnasium/-/blob/v3.9.6/advisory/repo.go#L217

Non-functional requirements

Documentation:
Feature flag:
Performance:
Testing:

Implementation plan

In ee/app/services/sbom/ingestion/tasks/ingest_components.rb, normalize the component name prior to insertion. For example:

diff --git a/ee/app/services/sbom/ingestion/tasks/ingest_components.rb b/ee/app/services/sbom/ingestion/tasks/ingest_components.rb
index ddc187bdadf..075ab43a5be 100644
--- a/ee/app/services/sbom/ingestion/tasks/ingest_components.rb
+++ b/ee/app/services/sbom/ingestion/tasks/ingest_components.rb
@@ -24,7 +24,11 @@ def after_ingest

        def attributes
          occurrence_maps.map do |occurrence_map|
-            occurrence_map.to_h.slice(*COMPONENT_ATTRIBUTES)
+            component_attributes = occurrence_map.to_h.slice(*COMPONENT_ATTRIBUTES)
+
+            component_attributes[:name] = normalize_name(component_attributes[:name])
+
+            component_attributes
          end
        end

Add model validation to ensure that names Python packages only contain normalized characters.
Add DB constraint to ensure that sbom_components.name only contains normalized chars when purl_type is pypi.

Unicode characters must be supported.

Verification steps

Set up a project using the NodeJS/Express template, enable SBOM ingestion, and enable Dependency Scanning as documented in #364709 (closed).
Add a requirements.txt file that contains Django. A new pipeline is triggered.
List components for that pipelines as documented in #364709 (closed), and check the following:
- It contains a record such as name is cookie-parser, and PURL type is npm.
- It contains a record such as name is django, and PURL type is pypi.
- It does NOT contain a record such as name is Django, and PURL type is pypi.

This proves that names of pypi components are normalized, but names of npm components are not.

Edited Nov 24, 2022 by Thiago Figueiró