Ingest version_format v2 for licenses
Why are we doing this work
Imported package metadata has a large amount of duplication. In order to remove deduplication a new version_format
is added which compresses license-version data for a package by removing duplicates.
Relevant links
- Research spike: #407454 (closed)
- Discussion of relevant data structure and algorithm: #407454 (comment 1357478354)
Non-functional requirements
- Documentation: n/a
- Feature flag: an FF is needed to select which version of ingestion is used on an instance (see implementation plan)
- Performance: assess that the final size of the dataset is "significantly smaller" (see research spike discussion about this)
- Testing: n/a
Version Format v2
license-exporter will output a new version_format
(v2
).
URL
- Old format:
https://bucket/v1/purl_type/sequence/chunk.csv
- New format:
https://bucket/v2/licenses/purl_type/sequence/chunk.csv
Data Encoding
The data encoding for v2
is ndjson
.
Data Structure
The data structure has been changed from a csv with a package-version-license line to an object storing all license data for a given package.
To ensure that the data structure stored is the most compact possible the json
object will have the package name and the full set of licenses and their corresponding versions.
An example json object:
{
"package_name": "image_size",
"licenses": {
"default_licenses": ["Ruby"],
"highest_version": "3.2.0",
"other_licenses": { ["MIT"]: ["1.1.2", "1.1.3", "1.1.4"] }
}
}
To decrease data size, in actuality, the json object will be represented by a 3-element array with each key a position in the array:
{ "package_name": "image_size", "licenses": [["Ruby"], "3.2.0", [[["MIT"],["1.1.2", "1.1.3", "1.1.4"]]]] }
Data Storage in Database
The data object will be stored in the licenses
column of the pm_packages
table added in
To compress this dataset the license names (spdx_identifiers) will be converted to their int equivalents from the pm_licenses
table:
# with Ruby=1 and MIT=2
package.licenses = { package_name: "image_size", licenses: [[1], "3.2.0", [[[2],["1.1.2", "1.1.3", "1.1.4"]]]] }
Implementation plan
-
PackageMetadata::SyncConfiguration.all_by_enabled_purl_types updated to use version_format
v2
when thepackage_metadata_license_version_format_v2
feature flag is set -
PackageMetadata::SyncService updated to use version_format
- pass
version_format
to ingestion - pass
version_format
to checkpoint
- pass
-
PackageMetadata::Connector::Gcp updated to use version_format - form correct url based on version_format
- when version_format=
v2
the url changes to/v2/purl_type/licenses/sequence/chunk.ndjson
- when version_format=
- extract CsvFile into own
DataFile
which can parse bothjson
andcsv
- update connector to instantiate correct data file based on version_format
- form correct url based on version_format
-
PackageMetadata::DataObject copied to create new class DataObjectV2
with the following attributespackage
licenses_names
license_set
-
PackageMetadata::Ingestion::IngestionService split to add v2
service and tasks- add
IngestionServiceV2
with 2 tasks-
Tasks::IngestLicenses
(already exists) -
Tasks::IngestPackagesV2
- use bulk upsert with dictionary of licenses derived from first task
-
- add
Verification steps
Instructions on how to run ingestion are documented here: #409732 (comment 1386970564)
After ingestion is complete all exported purl_types
should be populated: e.g. PackageMetadata::Package.where(purl_type: x).count > 0
.
Checkpoints for all exported purl_types
in version_format
v2
should be set: e.g. PackageMetadata::Checkpoint.where(purl_type: x).where(version_format: 'v2').count > 0
.