Integration tests for semver_dialects using license export v2
Problem to solve
It's not possible to upgrade the backend to semver_dialects
v3
because it causes too many errors in License Scanning.
See !151761 (comment 1896877139)
We need to investigate to establish why semver_dialects
v3
causes License Scanning errors, and how it needs to be adjusted.
Proposal
Add integration tests that validate the semver_dialects
gem
against the license data published on the Package Metadata DB (AKA License DB).
Use export v2 to ensure that License Scanning (implemented in the backend) can process the entire dataset using the gem.
- Check that
lowest_version
andhighest_version
can be parsed. - Check that all versions that aren't listed in
other_licenses
can be parsed w/o errors, and that they're within the boundaries (i.e. greater than or equal tolowest_version
and lower than or equal tohighest_version
).
Using the license exporter
Make the license exporter lists versions that are otherwise omitted when a CLI flag is set. The flag is disabled by default. To be implemented in https://gitlab.com/gitlab-org/security-products/license-db/license-exporter/-/blob/dfc7e472f26bd28ebc8f1f4a8bb1a40b98d1cc7d/data/license_group.go#L107
Developers run the license exporter locally (but connected to the production database)
to generate the full dataset.
They might remove other_licenses
and other fields they don't need to run the tests.
Using exports v1 & v2
Use export v1 to list package versions that License Scanning might have to scan. (We might not list all the versions referenced as project dependencies, but this it the best dataset we have w/o extracting project dependencies from the production database).
Pseudo-code
- For each package type,
- For each JSON object of export v2 of licenses (NDJSON files),
- Extract corresponding package versions from export v1 (CSV files).
- Filter out versions that are in
other_licenses
. - For all these versions,
- Parse version using the gem.
- Report any parsing error.
- Keep version if no error.
- For
lowest_version
andhighest_version
,- Create an interval.
- Report parsing error if any.
- For all versions not in
other_licenses
and w/o errors,- Use the gem to determine if the version is in range.
- Report error if version not in interval.
- For each JSON object of export v2 of licenses (NDJSON files),
This is similar to what's been done in Generation expectations for version matching us... (#435473 - closed) and Add expectations to semver_dialects based on di... (#438860 - closed) prior to Add version matching edge cases to the semver_d... (#386070 - closed).
CI job
Ideally these new integration tests run in a CI job of the semver_dialects project.
It might be expensive to fetch the entire dataset and to run tests against that dataset, so the job might be a manual one.
We might limit the tests to a short list of packages to speed up the tests, and to ensure they pass. For instance, we could limit the tests to the top N most popular packages. In that case the test job would always run. See #462854 (comment 1915259041)
The job might be allowed to failed b/c some failures might be legitimate errors.
Invalid versions should always be listed under other_licenses
,
but that might not be always the case.
- Versions might be considered as valid by the export even though they're not. See https://gitlab.com/gitlab-org/security-products/license-db/license-exporter/-/blob/dfc7e472f26bd28ebc8f1f4a8bb1a40b98d1cc7d/data/license_group.go#L89
- Versions might be removed because of the set contains too many versions. See https://gitlab.com/gitlab-org/security-products/license-db/license-exporter/-/blob/dfc7e472f26bd28ebc8f1f4a8bb1a40b98d1cc7d/data/license_group.go#L124
Alternatively, developers edit the generated expectations to remove what's incorrect.
Related links
- semver_dialects v3 (gitlab-org/ruby/gems/semver_dialects!72 - merged)
- Upgrade to semver_dialects 3.0.0 – REVERTED (!151761 - merged)
- Revert "Merge branch 'upgrade-to-semver_dialect... (!152342 - merged)
- https://gitlab.com/gitlab-org/gitlab/-/blob/dc02de895b40d4150bea7b628e21047ebb674946/ee/app/models/package_metadata/package.rb#L89-115
- https://gitlab.com/gitlab-org/security-products/license-db/license-exporter/#data-format Generation expectations for version matching us... (#435473 - closed)
- Add version matching edge cases to the semver_d... (#386070 - closed)
Implementation plan
-
Generate a dataset that combines lowest_version
,highest_version
, andversions
that are valid and have the default set of licenses. (These are normally omitted in v2 exports.) -
Add specs to test semver_dialects
against that dataset.- All version can be parsed.
-
versions
are within boundaries.
-
Investigate failures, and create issues if needed. -
Optional: Create curated test data, and add it to the repo. -
Optional: Add a CI job that tests the gem against the curated test data.