Add service to import package metadata into the DB

Problem to solve

Data imported from the external license db needs to be stored in the instance database. The amount of data imported could be quite large and the database tables for packages, their versions, and licenses have uniqueness constraints which require handling of conflicts in order to store the created row IDs in their associated tables.

Since data sizes are quite large, a bulk insert is required.

Proposal

Create a service that can ingest package-version-license rows, batch them, and store them in the database. Ensuring that conflicts are handled correctly.

This service will be invoked by PackageMetadata::SyncService with a [package, version, license] data row. The service should indicate successful saves to the caller so that the caller can in turn update the last sync position to which the data belonged.

pm_package_versions, and pm_package_version_licenses tables store ids with which they are associated (i.e. pm_package_id). Because of this a mapping of IDs representing created packages and licenses needs to be maintained in order to populate the associated ids correctly.

Implementation Plan

  • add PackageMetadata::Import::ImportService under ee/app/services/package_metadata/import
    • allow client code to #execute service with a batch of data
    • for each tuple call ingest on the appropriate task

Use Gitlab::Ingestion::BulkInsertableTask

This approach is quite a bit cleaner than alternatives, but has one caveat in that using an occurrence map of ids encountered could take up significant amounts of memory. Care must be taken to keep the id mapping below a certain limit. Perhaps by removing least recently used values. This is OK since insert ... on conflict ... return id will always give back the correct data.

  • add PackageMetadata::Import::OccurrenceMap to store mappings of ingested attributes.
    • { $name: { id: $pm_packages.id, versions: { $version: $pm_package_versions.id }, $license: }
  • using sbom ingestion as a template, create tasks under package_metadata/import/tasks directory in ee/app/services
    • PackageMetadata::Import::Tasks::Packages exposing id and name as unique attributes
      • implement after_ingest to populate package.id back to PackageMetadata::Import::OccurrenceMap
    • PackageMetadata::Import::Tasks::PackageVersions exposing id, pm_package_id, and version as unique attributes
      • implement after_ingest to populate package.id back to PackageMetadata::Import::OccurrenceMap
    • PackageMetadata::Import::Tasks::Licenses exposing id and spdx_identifier as unique attributes
      • implement after_ingest to populate package.id back to PackageMetadata::Import::OccurrenceMap
    • PackageMetadata::Import::Tasks::PackageVersionLicenses

Note: this issue depends on Partition package metadata tables (#382567) which adds a column to 2 new tables:

  • pm_package_versions gets a purl_type column
  • pm_package_version_licenses gets a purl_type column

Testing

Shifting left, could be tested by appropriate DB/unit tests. Not a candidate for E2E testing in isolation, however the overall new License DB would need tested.

Edited by Will Meek