License DB exports duplicate packages (Python)

Summary

License DB exports duplicate that correspond to the same package spelled differently. For instance, it exports license information for Python package PyPatchMatch as PyPatchMatch and pypatchmatch.

It's been reported in the context of NDJSON v2 exports but it probably impacts CSV v1 exports as well.

Package names should be normalized during the ingestion, similar to what's implemented in the backend. See https://gitlab.com/gitlab-org/gitlab/-/blob/0b3828cb01b9e21a5c9e9ee691ebad26ab323378/lib/sbom/package_url/normalizer.rb#L17

Duplicates are now ignored by the backend, so this is not a minor bug with no significant impact on users. See Cardinality error on ingestion of v2 licenses (#415236 - closed).

Further details

The duplicates live in the database itself. For instance, right now the following SQL query returns licenses for PyPatchMatch and for pypatchmatch. See #415236 (comment 1429519156)

select  pypi_component.name, pypi_license.license_ids , pypi_license."version" 
from pypi_component 
join pypi_license on pypi_license.component_id = pypi_component.id 
where name = 'PyPatchMatch' or name = 'pypatchmatch'

What is the current bug behavior?

Exports contain duplicate that correspond to the same Python package, like one line for PyPatchMatch and another one for pypatchmatch.

What is the expected correct behavior?

There should be only one occurrence of a given Python package, like pypatchmatch (normalized name) OR PyPatchMatch (canonical name) but not both.

Relevant logs and/or screenshots

See #415236 (comment 1429505690)

grep -E 'pypatchmatch' -r ./
.//1686042289/000000005.ndjson:{"name":"pypatchmatch","lowest_version":"0.1.4","highest_version":"0.1.5","default_licenses":["unknown"]}
grep -E 'PyPatchMatch' -r ./
.//1686042289/000000005.ndjson:{"name":"PyPatchMatch","lowest_version":"0.1.6-a0","highest_version":"1.0.0","default_licenses":["unknown"]}

Possible fixes

2 options:

  • Normalize package names in the feeder and/or in the processor.
  • Check for duplicates using name normalization, and keep the canonical name.

It would be useful to have canonical names to implement Show canonical component names in Dependency List (#375715). However, it seems simpler to store normalized names. Names are normalized by the backend during ingestion.

Proposal

Normalize package names in the feeder and/or in the processor.

Implementation plan

TBD

/cc @nilieskou @philipcunningham

Edited by Fabien Catteau