Spike: Measure impact of ignoring highest version when querying compressed package metadata
Topic to Evaluate
Assess the impact of Add highest version support when querying compr... (#410434 - closed).
Tasks to Evaluate
-
Describe a strategy to measure the impact (error ratio) -
Implement this strategy in a script -
Share error ratio for each package type
Proposal
Since export v1 gives the licenses of every single version, we can rely on it to identify classification errors when export v2 is not interpreted correctly.
This can be implemented in a Ruby script that calls semver_dialects
, and that emulates what the backend would to.
The script would take the following arguments:
- root directory of export; this contains both
v1
andv2
- registry name/package type (subdirectory)
Implementation
- Load all NDJSON files:
v2/#{registry}/*/*.ndjson
. - Iterate CSV files:
v1//#{registry}/*/*.csv
.- Iterate lines, and accumulate lines that correspond to the same package version into a license set.
- For each package version and its license set loaded from export v1,
- Query compressed data loaded from NDJSON.
- Compare query result to license set loaded from v1.
- Track success or error.
- Show error ratio.
Warning! This might consume a significant amount of memory.
Export v1 and v2 are downloaded locally using gsutil
prior to running the script.
Alternatively, prepare a GitLab instance that has both export v1 and export v2 in its Postgresql database, and write a Ruby script that runs in the Rails console. Con: Sync with v1 has been removed from the codebase.
Implementation Plan
Preparation
-
Download v1 and v2 license data: You will need gsutil. You can download it using the GCP instructions ```bash # First go to your project root directory gsutil cp -r gs://prod-export-license-bucket-1a6c642fc4de57d4/v1 ./ gsutil cp -r gs://prod-export-license-bucket-1a6c642fc4de57d4/v2 ./ ```
At the end of this step you will have two folders with all the licenses.
Implementing the script
-
Create data structures required to store a license set, a license group and compressed license groups. Links provided are from go
code so you will need to translate them intoRuby
. -
Create a simple command line tool that receives a registry name. Possible values are: conan
,go
,maven
,npm
,nuget
,packagist
,pypi
andrubygem
. Make sure that yourv1/
andv2/
dirs are in the root of your project. -
We need to load the v2
data in memory. These files use the compressed license group data structure. Each line of thev2/<registry>/<timestamp>/*.ndjson
is a json object representing a compressed license group hence parsing should be simple. In the end of this step we have a table of compressed license groups. -
For each line of v1/<registry>/<timestamp>/*.csv
:- Parse the package name , version and license.
- Using the package name find the corresponding compressed license group from the v2 array.
- Using the version field from
v1
find the respective version from the compressed license group element. For the version comparison you can use thesemver_dialects
gem. - Track hits or misses. In case of a miss keep the name of the package, version , license in v1 and license in v2. This will require a new data structure.
-
The script outputs the error ratio. We could extend this with further statistics.