Remove duplicate package names for licenses
Problem to solve
The Postgres database of License DB contains duplicate Go and Python packages, because package names were not normalized prior #418911 (closed). These duplicates need to be merged into the record that has the canonical name, or removed so that exports only contain relevant data.
This problem is limited to the pypi_components
and go_components
tables (which are reference by the pypi_license
and `go_license tables, respectively).
User impact
This shouldn't impact users. See #418913 (comment 1500501795)
- The License DB database and its export still contains duplicate packages. This is annoying but it shouldn't affected users.
- License DB no longer generate new duplicates.
- License DB no longer touches the
updated_at
of the exiting duplicates w/ non-normalized names. - During the sync, the last duplicate wins b/c the backend normalizes package names.
- Then eventually, the backend gets correct data.
✅
Links
- Normalize package names for Go and Python in Li... (#418911 - closed)
- [Feature flag] Rollout of `compressed_package_m... (#409793 - closed)
- #409793 (comment 1463711514)
- #415236 (comment 1429519156)
Possible fixes
- Truncate component tables, and ingest everything from pypi and go registries again.
- Pros: Accurate and simple.
- Cons: It triggers a full export, and conversely a full import.
- Save
updated_at
, truncate tables, restore, export.- Pros: Accurate. Minimal export.
- Cons: Complex.
- Delete components that have duplicates, including the ones that have normalized names. Wait for these components to be re-ingested and re-exported again.
- Pros: Accurate. Fairly simple. Minimal export.
- Cons: ?
- Merge duplicates into records that have normalized packages names, and update
updated_at
column of resulting records to export them again. #418913 (comment 1474555873)- Pros: Accurate. No ingestion needed. Minimal export.
- Cons: Complex.
Proposal
- Delete components that have duplicates, including the ones that have normalized names. Wait for these components to be re-ingested and re-exported again.
Implementation plan
-
List duplicates using inner join on *_package
table. -
Store a list of these duplicates. We could export these from Postgres using COPY
. -
Delete the duplicates on dev, and check -
Delete the duplicates on production.
Edited by Fabien Catteau