Skip to content

Remove duplicate package names for licenses

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Problem to solve

The Postgres database of License DB contains duplicate Go and Python packages, because package names were not normalized prior #418911 (closed). These duplicates need to be merged into the record that has the canonical name, or removed so that exports only contain relevant data.

This problem is limited to the pypi_components and go_components tables (which are reference by the pypi_license and `go_license tables, respectively).

User impact

This shouldn't impact users. See #418913 (comment 1500501795)

  • The License DB database and its export still contains duplicate packages. This is annoying but it shouldn't affected users.
  • License DB no longer generate new duplicates.
  • License DB no longer touches the updated_at of the exiting duplicates w/ non-normalized names.
  • During the sync, the last duplicate wins b/c the backend normalizes package names.
  • Then eventually, the backend gets correct data.

Links

Possible fixes

  1. Truncate component tables, and ingest everything from pypi and go registries again.
    • Pros: Accurate and simple.
    • Cons: It triggers a full export, and conversely a full import.
  2. Save updated_at, truncate tables, restore, export.
    • Pros: Accurate. Minimal export.
    • Cons: Complex.
  3. Delete components that have duplicates, including the ones that have normalized names. Wait for these components to be re-ingested and re-exported again.
    • Pros: Accurate. Fairly simple. Minimal export.
    • Cons: ?
  4. Merge duplicates into records that have normalized packages names, and update updated_at column of resulting records to export them again. #418913 (comment 1474555873)
    • Pros: Accurate. No ingestion needed. Minimal export.
    • Cons: Complex.

Proposal

  1. Delete components that have duplicates, including the ones that have normalized names. Wait for these components to be re-ingested and re-exported again.

Implementation plan

  • List duplicates using inner join on *_package table.
  • Store a list of these duplicates. We could export these from Postgres using COPY.
  • Delete the duplicates on dev, and check
  • Delete the duplicates on production.
Edited by 🤖 GitLab Bot 🤖