Add service to import package metadata into the DB
Problem to solve
Data imported from the external license db needs to be stored in the instance database. The amount of data imported could be quite large and the database tables for packages, their versions, and licenses have uniqueness constraints which require handling of conflicts in order to store the created row IDs in their associated tables.
Since data sizes are quite large, a bulk insert is required.
Proposal
Create a service that can ingest package-version-license rows, batch them, and store them in the database. Ensuring that conflicts are handled correctly.
This service will be invoked by PackageMetadata::SyncService with a [package
, version
, license
] data row. The service should indicate successful saves to the caller so that the caller can in turn update the last sync position to which the data belonged.
pm_package_versions
, and pm_package_version_licenses
tables store ids with which they are associated (i.e. pm_package_id
). Because of this a mapping of IDs representing created packages and licenses needs to be maintained in order to populate the associated ids correctly.
Implementation Plan
-
add PackageMetadata::Import::ImportService
underee/app/services/package_metadata/import
- allow client code to
#execute
service with a batch of data - for each tuple call ingest on the appropriate task
- allow client code to
Use Gitlab::Ingestion::BulkInsertableTask
This approach is quite a bit cleaner than alternatives, but has one caveat in that using an occurrence map of ids encountered could take up significant amounts of memory. Care must be taken to keep the id mapping below a certain limit. Perhaps by removing least recently used values. This is OK since insert ... on conflict ... return id
will always give back the correct data.
-
add PackageMetadata::Import::OccurrenceMap
to store mappings of ingested attributes.- {
$name
: {id
:$pm_packages.id
,versions
: {$version
:$pm_package_versions.id
},$license
: }
- {
-
using sbom ingestion as a template, create tasks under package_metadata/import/tasks
directory in ee/app/services-
PackageMetadata::Import::Tasks::Packages
exposingid
andname
as unique attributes-
implement after_ingest
to populatepackage.id
back toPackageMetadata::Import::OccurrenceMap
-
-
PackageMetadata::Import::Tasks::PackageVersions
exposingid
,pm_package_id
, andversion
as unique attributes-
implement after_ingest
to populatepackage.id
back toPackageMetadata::Import::OccurrenceMap
-
-
PackageMetadata::Import::Tasks::Licenses
exposingid
andspdx_identifier
as unique attributes-
implement after_ingest
to populatepackage.id
back toPackageMetadata::Import::OccurrenceMap
-
-
PackageMetadata::Import::Tasks::PackageVersionLicenses
-
Note: this issue depends on Partition package metadata tables (#382567) which adds a column to 2 new tables:
- pm_package_versions gets a
purl_type
column - pm_package_version_licenses gets a
purl_type
column
Testing
Shifting left, could be tested by appropriate DB/unit tests. Not a candidate for E2E testing in isolation, however the overall new License DB would need tested.