Update package metadata license lookup to use compressed data
What does this MR do and why?
This MR is the first of two, and has been split this way to reduce the size of each MR:
- This MR
-
adds a new
compressed_package_metadata_queryfeature flag.This MR stubs
compressed_package_metadata_queryto false in all the tests. -
updates
PackageLicenses#fetchto use compressed data if thecompressed_package_metadata_queryfeature flag is enabled.
-
-
Test both sides of compressed_package_metadata_... (!120207 - merged)
- Tests both sides of the
compressed_package_metadata_queryfeature flag.
- Tests both sides of the
Background
When PackageLicenses#fetch was first implemented, it performed queries that relied on storing every single license for every package version. This was relatively straightforward to implement, and it made sense at the time since we assumed that all the data would take up about 5GB, however, it turns out that the complete license dataset takes up 26GB.
In an effort to reduce the memory requirements, we've come up with a structure which effectively "compresses" the license information by only recording versions where the licenses change. Packages might have a large number of versions, but most only use a single license, so this allows us to reduce the required storage dramatically, to about 1GB.
The existing PackageLicenses#fetch (aka the "uncompressed" fetcher) executes queries against the following tables:
pm_packagespm_package_versionspm_package_version_licensespm_licenses
The new "compressed" fetcher only needs to query the following tables:
-
pm_packagesThe
licensesjsonb column in this table contains all the necessary license information, in a compressed structure. This allows us to avoid needing to join other tables, and reduces data footprint considerably. -
pm_licensesThis allows us to look up the license identifier for each entry in the
licensescolumn fetched from thepm_packagestable.This table contains less than
500entries and is read into memory, allowing us to avoid separate queries for every license.
How does PackageLicenses#fetch work?
The Gitlab::LicenseScanning::PackageLicenses#fetch method accepts the output of the following two methods:
Gitlab::LicenseScanning::PipelineComponents.fetchGitlab::LicenseScanning::BranchComponents.fetch
These methods were added by Add classes for fetching SBOM components (!105994 - merged) and output an array of components, each of which have the following fields:
namepurl_typeversion
We then pass this array of components into the Gitlab::LicenseScanning::PackageLicenses.fetch which will return the corresponding licenses by querying pm_packages and pm_licenses tables which was added by Update DB schema to store data imported from th... (#373163 - closed).
The Gitlab::LicenseScanning::PipelineComponents.fetch method is required by the new license scanning implementation that is a part of &9400 (closed). This new approach relies on SBoM components instead of a license scanning report and will allow us to remove the need for a separate license scanning job.
Terms and concepts referenced in this MR
-
componentA
Hashie::Mashobject (which will most likely be changed to a PORO in the future) that come from the SBOM report for the pipeline. This SBOM is generated by the GitLab gemnasium analyzer.The conversion of SBOM report to components takes place in Gitlab::LicenseScanning::PipelineComponents#fetch.
Each
componentis a dependency of a project and consists of the following fields:namepurl_typeversion
-
pm_packagesThe
pm_packagestable contains every known package, across every package manager (maven,gem,golang,nuget,npm, etc) and there's currently over5,319,434entries in thepm_packagestable.The fields of interest in the
pm_packagestable are:-
purl_type: an integer representing the purl_type of thepackage. See Enums::Sbom for the complete list of supportpurl_types. -
name: the name of the package, for examplemyapi-fastapiorgithub.com/ocaml/ocamlbuild. -
licenses: ajsonbcolumn containing a serialized list of license information such as:-
default_licenses- an array of license ids in thepm_licensestable which are returned when we couldn't determine a matching license -
lowest_version- the lowest version we support. -
highest_version- the highest version we support. -
other_versions- an array ofversionsand thelicense_idsthe correspond to.
-
-
Caches used in this MR
This MR adds two methods for caching data:
-
build_component_versions_cache
We fetch package details from the
pm_packagestable which only contains thepurl_typeandname. We need a way to know which versions were requested for eachpurl_typeandname, so we use a hash to store this data for faster lookups. -
There's only about
500licenses in thepm_licensestable, and the data doesn't change often, so we use a cache to avoid a separate sql query for every spdx identifier.
Both of these caches are used by the Gitlab::LicenseScanning::PackageLicenses#compressed_fetch method.
Benchmarks
Benchmark results of fetching N records using both compressed and uncompressed package metadata. The compressed package metadata fetcher (the first table) is being implemented in this MR.
Benchmark results with compressed package metadata
===================================================================
user system total real
100 records 0.017064 0.000331 0.017395 ( 0.019686)
200 records 0.026447 0.000283 0.026730 ( 0.031948)
300 records 0.032604 0.000376 0.032980 ( 0.041379)
400 records 0.039828 0.000596 0.040424 ( 0.050058)
500 records 0.049104 0.000743 0.049847 ( 0.060847)
600 records 0.057560 0.000830 0.058390 ( 0.070677)
700 records 0.065156 0.001347 0.066503 ( 0.081667)
800 records 0.072402 0.000933 0.073335 ( 0.090494)
900 records 0.079092 0.001450 0.080542 ( 0.098282)
1000 records 0.087363 0.001284 0.088647 ( 0.110238)
2000 records 0.171223 0.002407 0.173630 ( 0.216396)
3000 records 0.249895 0.003460 0.253355 ( 0.317418)
4000 records 0.327978 0.004005 0.331983 ( 0.410994)
5000 records 0.447468 0.006151 0.453619 ( 0.550493)
===================================================================
Benchmark results with uncompressed package metadata
===================================================================
user system total real
100 records 0.007238 0.000080 0.007318 ( 0.010811)
200 records 0.012162 0.000075 0.012237 ( 0.019070)
300 records 0.019363 0.000179 0.019542 ( 0.031903)
400 records 0.024126 0.000355 0.024481 ( 0.039589)
500 records 0.029296 0.000270 0.029566 ( 0.047063)
600 records 0.035818 0.000891 0.036709 ( 0.058822)
700 records 0.041775 0.000384 0.042159 ( 0.065195)
800 records 0.046606 0.000716 0.047322 ( 0.072832)
900 records 0.052423 0.000537 0.052960 ( 0.083973)
1000 records 0.057706 0.000532 0.058238 ( 0.093362)
2000 records 0.117508 0.002094 0.119602 ( 0.203078)
3000 records 0.167786 0.001746 0.169532 ( 0.277484)
4000 records 0.225191 0.002783 0.227974 ( 0.374707)
5000 records 0.278470 0.002795 0.281265 ( 0.470225)
===================================================================
Data characteristics
Raw SQL
https://paste.depesz.com/s/CJF
SELECT
pm_packages.*
FROM
pm_packages
WHERE
(
pm_packages.purl_type,
pm_packages.name
) IN (
( 4, 'github.com/scop/powerline-go' ),
...
);
Query plan
5,000 random records out of 5,246,788
Planning Time: 23.864 ms
Execution Time: 107.283 ms
- https://explain.depesz.com/s/YsAw
- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/18789/commands/62401
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #408906 (closed)