Skip to content

Add package metadata ingestion for version format v2

What does this MR do and why?

This MR adds functionality to sync version_format v2 license data.

This format has a package json object per line with license data "compressed" under a single attribute.

2 tables are touched in the process of ingestion: pm_packages and pm_licenses.

  • License data is collected from the slice of objects passed to the ingestion service and upserted into pm_licenses.
  • A map of license spdx_identifiers to their ids is built so it can be used to further pm_packages data.
  • Package data is compressed by translating license name to their db ids and converting the json object under licenses to an array. This dataset is then upserted.

How to set up and validate locally

Prepare dataset

Currently only this dataset is available: #409732 (comment 1386970564)

Because the data is not yet in the v2 url format, it needs to be downloaded, converted to have the correct path, and synced in offline mode (by writing the data to vendor/package_metadata_db/v2).

download.rb: download.rb

Run it via: ruby download.rb

Note: Move download to GitLab dir.

Run ingestion via rails runner

ingest.rb: ingest.rb

Run this via: bundle exec rails runner ingest.rb

Note: The PM_SYNC_INDEV environment flag controls whether sync runs in the development environment. It is false by default. Ensure you can sync via export PM_SYNC_INDEV=true before running ingest.rb.

Progress

Sync progress can be see in log/application_json.log where the sync url is indicated.

Progress can also be observed via checkpoints bundle exec rails runner 'puts PackageMetadata::Checkpoint.where(version_format: "v2").all.map(&:attributes)'

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #408901 (closed)

Edited by Igor Frenkel

Merge request reports

Loading