Integration tests for semver_dialects using license export v2

Problem to solve

It's not possible to upgrade the backend to semver_dialects v3 because it causes too many errors in License Scanning. See !151761 (comment 1896877139)

We need to investigate to establish why semver_dialects v3 causes License Scanning errors, and how it needs to be adjusted.

Proposal

Add integration tests that validate the semver_dialects gem against the license data published on the Package Metadata DB (AKA License DB).

Use export v2 to ensure that License Scanning (implemented in the backend) can process the entire dataset using the gem.

Check that lowest_version and highest_version can be parsed.
Check that all versions that aren't listed in other_licenses can be parsed w/o errors, and that they're within the boundaries (i.e. greater than or equal to lowest_version and lower than or equal to highest_version).

Using the license exporter

Make the license exporter lists versions that are otherwise omitted when a CLI flag is set. The flag is disabled by default. To be implemented in https://gitlab.com/gitlab-org/security-products/license-db/license-exporter/-/blob/dfc7e472f26bd28ebc8f1f4a8bb1a40b98d1cc7d/data/license_group.go#L107

Developers run the license exporter locally (but connected to the production database) to generate the full dataset. They might remove other_licenses and other fields they don't need to run the tests.

Using exports v1 & v2

Use export v1 to list package versions that License Scanning might have to scan. (We might not list all the versions referenced as project dependencies, but this it the best dataset we have w/o extracting project dependencies from the production database).

Pseudo-code

For each package type,
- For each JSON object of export v2 of licenses (NDJSON files),
  - Extract corresponding package versions from export v1 (CSV files).
  - Filter out versions that are in other_licenses.
  - For all these versions,
    - Parse version using the gem.
    - Report any parsing error.
    - Keep version if no error.
  - For lowest_version and highest_version,
    - Create an interval.
    - Report parsing error if any.
    - For all versions not in other_licenses and w/o errors,
      - Use the gem to determine if the version is in range.
      - Report error if version not in interval.

CI job

Ideally these new integration tests run in a CI job of the semver_dialects project.

It might be expensive to fetch the entire dataset and to run tests against that dataset, so the job might be a manual one.

We might limit the tests to a short list of packages to speed up the tests, and to ensure they pass. For instance, we could limit the tests to the top N most popular packages. In that case the test job would always run. See #462854 (comment 1915259041)

The job might be allowed to failed b/c some failures might be legitimate errors. Invalid versions should always be listed under other_licenses, but that might not be always the case.

Versions might be considered as valid by the export even though they're not. See https://gitlab.com/gitlab-org/security-products/license-db/license-exporter/-/blob/dfc7e472f26bd28ebc8f1f4a8bb1a40b98d1cc7d/data/license_group.go#L89
Versions might be removed because of the set contains too many versions. See https://gitlab.com/gitlab-org/security-products/license-db/license-exporter/-/blob/dfc7e472f26bd28ebc8f1f4a8bb1a40b98d1cc7d/data/license_group.go#L124

Alternatively, developers edit the generated expectations to remove what's incorrect.

Implementation plan

Generate a dataset that combines lowest_version, highest_version, and versions that are valid and have the default set of licenses. (These are normally omitted in v2 exports.)
Add specs to test semver_dialects against that dataset.
- All version can be parsed.
- versions are within boundaries.
Investigate failures, and create issues if needed.
Optional: Create curated test data, and add it to the repo.
Optional: Add a CI job that tests the gem against the curated test data.

Edited Jun 05, 2024 by Fabien Catteau