Ingest version_format v2 for licenses

Why are we doing this work

Imported package metadata has a large amount of duplication. In order to remove deduplication a new version_format is added which compresses license-version data for a package by removing duplicates.

Relevant links

Research spike: #407454 (closed)
Discussion of relevant data structure and algorithm: #407454 (comment 1357478354)

Non-functional requirements

Documentation: n/a
Feature flag: an FF is needed to select which version of ingestion is used on an instance (see implementation plan)
Performance: assess that the final size of the dataset is "significantly smaller" (see research spike discussion about this)
Testing: n/a

Version Format v2

license-exporter will output a new version_format (v2).

URL

Old format: https://bucket/v1/purl_type/sequence/chunk.csv
New format: https://bucket/v2/licenses/purl_type/sequence/chunk.csv

Data Encoding

The data encoding for v2 is ndjson.

Data Structure

The data structure has been changed from a csv with a package-version-license line to an object storing all license data for a given package.

To ensure that the data structure stored is the most compact possible the json object will have the package name and the full set of licenses and their corresponding versions.

An example json object:

{
  "package_name": "image_size",
  "licenses": { 
    "default_licenses": ["Ruby"],
    "highest_version": "3.2.0",
    "other_licenses": { ["MIT"]: ["1.1.2", "1.1.3", "1.1.4"] }
  }
}

To decrease data size, in actuality, the json object will be represented by a 3-element array with each key a position in the array:

{ "package_name": "image_size", "licenses": [["Ruby"], "3.2.0", [[["MIT"],["1.1.2", "1.1.3", "1.1.4"]]]] }

Data Storage in Database

The data object will be stored in the licenses column of the pm_packages table added in

To compress this dataset the license names (spdx_identifiers) will be converted to their int equivalents from the pm_licenses table:

# with Ruby=1 and MIT=2
package.licenses = { package_name: "image_size", licenses: [[1], "3.2.0", [[[2],["1.1.2", "1.1.3", "1.1.4"]]]] }

Implementation plan

PackageMetadata::SyncConfiguration.all_by_enabled_purl_types updated to use version_format v2 when the package_metadata_license_version_format_v2 feature flag is set
PackageMetadata::SyncService updated to use version_format
- pass version_format to ingestion
- pass version_format to checkpoint
PackageMetadata::Connector::Gcp updated to use version_format
- form correct url based on version_format
  - when version_format=v2 the url changes to /v2/purl_type/licenses/sequence/chunk.ndjson
- extract CsvFile into own DataFile which can parse both json and csv
- update connector to instantiate correct data file based on version_format
PackageMetadata::DataObject copied to create new class DataObjectV2 with the following attributes
- package
- licenses_names
- license_set
PackageMetadata::Ingestion::IngestionService split to add v2 service and tasks
- add IngestionServiceV2 with 2 tasks
  - Tasks::IngestLicenses (already exists)
  - Tasks::IngestPackagesV2
    - use bulk upsert with dictionary of licenses derived from first task

Verification steps

Instructions on how to run ingestion are documented here: #409732 (comment 1386970564)

After ingestion is complete all exported purl_types should be populated: e.g. PackageMetadata::Package.where(purl_type: x).count > 0.

Checkpoints for all exported purl_types in version_format v2 should be set: e.g. PackageMetadata::Checkpoint.where(purl_type: x).where(version_format: 'v2').count > 0.

Edited Jul 05, 2023 by Igor Frenkel