Spike: Measure impact of ignoring highest version when querying compressed package metadata

Topic to Evaluate

Assess the impact of Add highest version support when querying compr... (#410434 - closed).

Tasks to Evaluate

Describe a strategy to measure the impact (error ratio)
Implement this strategy in a script
Share error ratio for each package type

Proposal

Since export v1 gives the licenses of every single version, we can rely on it to identify classification errors when export v2 is not interpreted correctly.

This can be implemented in a Ruby script that calls semver_dialects, and that emulates what the backend would to.

The script would take the following arguments:

root directory of export; this contains both v1 and v2
registry name/package type (subdirectory)

Implementation

Load all NDJSON files: v2/#{registry}/*/*.ndjson.
Iterate CSV files: v1//#{registry}/*/*.csv.
- Iterate lines, and accumulate lines that correspond to the same package version into a license set.
- For each package version and its license set loaded from export v1,
  - Query compressed data loaded from NDJSON.
  - Compare query result to license set loaded from v1.
  - Track success or error.
Show error ratio.

Warning! This might consume a significant amount of memory.

Export v1 and v2 are downloaded locally using gsutil prior to running the script.

Alternatively, prepare a GitLab instance that has both export v1 and export v2 in its Postgresql database, and write a Ruby script that runs in the Rails console. Con: Sync with v1 has been removed from the codebase.

Implementation Plan

Preparation

Download v1 and v2 license data: You will need gsutil. You can download it using the GCP instructions

```bash
# First go to your project root directory
gsutil cp  -r  gs://prod-export-license-bucket-1a6c642fc4de57d4/v1 ./
gsutil cp  -r  gs://prod-export-license-bucket-1a6c642fc4de57d4/v2 ./
```

At the end of this step you will have two folders with all the licenses.

Implementing the script

Create data structures required to store a license set, a license group and compressed license groups. Links provided are from go code so you will need to translate them into Ruby.
Create a simple command line tool that receives a registry name. Possible values are: conan,go ,maven ,npm ,nuget ,packagist ,pypi and rubygem. Make sure that your v1/ and v2/ dirs are in the root of your project.
We need to load the v2 data in memory. These files use the compressed license group data structure. Each line of the v2/<registry>/<timestamp>/*.ndjson is a json object representing a compressed license group hence parsing should be simple. In the end of this step we have a table of compressed license groups.
For each line of v1/<registry>/<timestamp>/*.csv :
- Parse the package name , version and license.
- Using the package name find the corresponding compressed license group from the v2 array.
- Using the version field from v1 find the respective version from the compressed license group element. For the version comparison you can use the semver_dialects gem.
- Track hits or misses. In case of a miss keep the name of the package, version , license in v1 and license in v2. This will require a new data structure.
The script outputs the error ratio. We could extend this with further statistics.

/cc @philipcunningham @gonzoyumo

Edited Oct 19, 2023 by Nick Ilieskou