Ingest version_format v2 for licenses
Why are we doing this work
Imported package metadata has a large amount of duplication. In order to remove deduplication a new version_format is added which compresses license-version data for a package by removing duplicates.
Relevant links
- Research spike: #407454 (closed)
- Discussion of relevant data structure and algorithm: #407454 (comment 1357478354)
Non-functional requirements
- Documentation: n/a
- Feature flag: an FF is needed to select which version of ingestion is used on an instance (see implementation plan)
- Performance: assess that the final size of the dataset is "significantly smaller" (see research spike discussion about this)
- Testing: n/a
Version Format v2
license-exporter will output a new version_format (v2).
URL
- Old format:
https://bucket/v1/purl_type/sequence/chunk.csv - New format:
https://bucket/v2/licenses/purl_type/sequence/chunk.csv
Data Encoding
The data encoding for v2 is ndjson.
Data Structure
The data structure has been changed from a csv with a package-version-license line to an object storing all license data for a given package.
To ensure that the data structure stored is the most compact possible the json object will have the package name and the full set of licenses and their corresponding versions.
An example json object:
{
"package_name": "image_size",
"licenses": {
"default_licenses": ["Ruby"],
"highest_version": "3.2.0",
"other_licenses": { ["MIT"]: ["1.1.2", "1.1.3", "1.1.4"] }
}
}
To decrease data size, in actuality, the json object will be represented by a 3-element array with each key a position in the array:
{ "package_name": "image_size", "licenses": [["Ruby"], "3.2.0", [[["MIT"],["1.1.2", "1.1.3", "1.1.4"]]]] }
Data Storage in Database
The data object will be stored in the licenses column of the pm_packages table added in
To compress this dataset the license names (spdx_identifiers) will be converted to their int equivalents from the pm_licenses table:
# with Ruby=1 and MIT=2
package.licenses = { package_name: "image_size", licenses: [[1], "3.2.0", [[[2],["1.1.2", "1.1.3", "1.1.4"]]]] }
Implementation plan
-
PackageMetadata::SyncConfiguration.all_by_enabled_purl_types updated to use version_formatv2when thepackage_metadata_license_version_format_v2feature flag is set -
PackageMetadata::SyncService updated to use version_format- pass
version_formatto ingestion - pass
version_formatto checkpoint
- pass
-
PackageMetadata::Connector::Gcp updated to use version_format - form correct url based on version_format
- when version_format=
v2the url changes to/v2/purl_type/licenses/sequence/chunk.ndjson
- when version_format=
- extract CsvFile into own
DataFilewhich can parse bothjsonandcsv - update connector to instantiate correct data file based on version_format
- form correct url based on version_format
-
PackageMetadata::DataObject copied to create new class DataObjectV2with the following attributespackagelicenses_nameslicense_set
-
PackageMetadata::Ingestion::IngestionService split to add v2service and tasks- add
IngestionServiceV2with 2 tasks-
Tasks::IngestLicenses(already exists) -
Tasks::IngestPackagesV2- use bulk upsert with dictionary of licenses derived from first task
-
- add
Verification steps
Instructions on how to run ingestion are documented here: #409732 (comment 1386970564)
After ingestion is complete all exported purl_types should be populated: e.g. PackageMetadata::Package.where(purl_type: x).count > 0.
Checkpoints for all exported purl_types in version_format v2 should be set: e.g. PackageMetadata::Checkpoint.where(purl_type: x).where(version_format: 'v2').count > 0.