Skip to content

Export licenses to version format v2

Why are we doing this work

Imported package metadata has a large amount of duplication and is causing db size issues for its consumers. license-exporter should be changed to remove the duplication in the dataset in order to store the most compact version_format possible.

Relevant links

Non-functional requirements

  • Documentation: n/a
  • Feature flag: n/a
  • Performance: n/a
  • Testing: n/a

Implementation outline

Deduplication is accomplished by grouping data for a package by licenses-to-versions sets. Because most packages have a single license, they do not need to store any information other than the license name which applies to the full dataset (e.g. { "rails": "MIT" }).

For packages with multiple licenses-to-versions sets, the data structure has to evolve. The default license-set will still be stored. Other licenses-to-version combinations need to store the license-set and the full list of versions that correspond to it. The data structure in the above example thus evolves to: { default: "MIT", other: { "Apache": ["7.0.1", "7.0.2", "7.0.3"] } }.

Additionally, the maximum version seen so far also needs to be stored so as not to misrepresent versions that have not yet been ingested. For example: if rails licenses have been ingested up to 7.0.0, the database has { rails: { default: MIT } } and when the caller queries 7.0.1 they will incorrectly infer that the license for this version is MIT. For this reason the maximum version seen so far is also stored. This can be done via a highest_version attribute: { default: "MIT", other: { "Apache": ["7.0.1", "7.0.2", "7.0.3"] }, highest_version: "7.0.0" } }

Sets of licenses shouldn't have duplicates.

Exported data should match the constraints we have in the JSON schema for pm_packages.licenses.

Pseudocode of changes

  • the URL written is updated to support version_format v2
    • Old format: https://bucket/v1/purl_type/sequence/chunk.csv
    • New format: https://bucket/v2/purl_type/sequence/chunk.ndjson
  • the bucket data is encoded as ndjson
  • the data structure output is updated to the above
  • the export algorithm is changed to
    1. fetch packages which have been updated since a given timestamp together with all of their licenses
    2. group license data by licenses together with their corresponding versions
    3. output resulting json object to file njdson file

Filtering unknown licenses

As discussed in the research spike unknown licenses are a large part of the dataset and do not need to be stored, deduplicating ingestion should filter these licenses out.

This optimization doesn't apply to packages that have multiple sets of licenses. For instance, it must export { default: "unknown", other: { "Apache": ["7.0.1", "7.0.2", "7.0.3"] }, highest_version: "7.0.0" } } if we only know the license of 7.0.1 to 7.0.2 (Apache).

Implementation plan

  • Update license-exporter to export in the new v2 format.
    • Add SQL queries to fetch new and updated packages using a CURSOR.
    • Add these queries to the Database struct type.
    • Add CLI flag to switch b/w v1 and v2. For backward compatibility v1 is the default.
    • Add NDJSON writer/encoder, similar to the existing CSV writer.
    • Refactor ObjectRotator to support both formats, and both writers/encoders.
    • Refactor lock file handling to support both formats.
    • Update the existing unit tests, and add new ones wherever this is needed.
  • Run license-exporter from deployment project.
    • Update CI configuration file.
    • Rename existing scheduled pipelines.
      • Rename to dev export to dev export v1.
      • Rename to dev prod to dev prod v1.
    • Create new scheduled pipelines.
      • Create new dev export v2 pipeline.
      • Create new prod export v2 pipeline.

Verification steps

Run the export and check the v2 directory of export bucket.

  • Export all.
  • Export since a given date (explicitly passed).
  • Export since last update (extracting from last export).

Check the lock mechanism.

  • Lock file is created when there's none.
  • Export is skipped when the lock file exists and it's not outdated.
  • Lock file is removed and the export runs when the lock file exists and it's outdated.

To be tested on the dev environment before deploying to prod.

Test updates

Update License Sanity test to support v2 URLs with expected minimum sizings

https://gitlab.com/gitlab-org/security-products/tests/license-db-sanity/-/blob/main/qa/spec/sanity_test_exec_spec.rb#L21

Edited by Will Meek