Export licenses for ranges of versions
Problem to solve
The Package Metadata DB (AKA License DB) exports license data using the export format v2 designed in Export licenses to version format v2 (#409732 - closed) and implemented in Reduce package metadata table on-disk footprint (&10415 - closed). This has helped reduce the size of the exports.
However, the current export format isn't a good fit when the package has a large number of versions, and when half of these versions don't have the default licenses. We end up with large JSON objects that lists hundreds of versions.
- We had to set limits for the number of versions listed in a JSON object (#442419 (closed)). As a consequence, License Scanning might return unknowns or incorrect results. It might also attempt to parse and compare invalid versions that should be listed under the
other_licenses
. See #442419 (comment 1914757289) - v2 exports are almost as big as v1 exports, even though the v2 format was introduced to reduce the size of exports. #462874 (comment 1943700164)
Proposal
Export ranges of versions with the corresponding licenses, or export sets of licenses with the corresponding ranges.
The license exporter sorts all package versions, and creates ranges that share the same set of licenses. It exports each version range as a pair of boundaries (lowest version and highest version).
On the backend License Scanning iterates the ranges until the requested version is in range.
The new export format is published as v3. For backward compatibility the exporter still generates v2 exports.
Further details
This proposal was discussed in Spike: How to reduce package metadata tables fo... (#407454 - closed).
Before that, it was discussed in Spike: Efficient storage of redundant licenses ... (#374901 - closed). See #374901 (comment 1111873025)
Challenges
The main challenge is to ensure that PMDB and the GitLab backend are consistent in the way they compare versions. Ideally we would use semver_dialects
in PMDB, possibly wrapped up in a service. See #462874 (comment 1915806452)
Alternatively, we could have a service that compresses raw NDJSON files (i.e. where all versions are listed) similar to the compression logic implemented in license-exporter. That service would be implemented in Ruby and would use semver_dialects.
See https://gitlab.com/gitlab-com/sec-sub-department/section-sec-request-for-help/-/issues/459#note_2230462006 for the problem that occurs by not using semver_dialects
in the license-exporter
.
/cc @nilieskou @ifrenkel