Add package metadata tables and models
What does this MR do and why?
Adds package metadata schema and models so that client code can retrieve license data for a project's sbom components.
Related issue: Update DB schema to store data imported from th... (#373163 - closed)
Sample data
Data was loaded from a sample maven registry license list. Sample data loaded from here (internal link).
SELECT i.relname "table",indexrelname "index",
pg_size_pretty(pg_total_relation_size(relid)) As "Total",
pg_size_pretty(pg_relation_size(relid)) as "Table Size",
pg_size_pretty(pg_relation_size(indexrelid)) "Index Size",
reltuples::bigint "Num rows estimate"
FROM pg_stat_all_indexes i JOIN pg_class c ON i.relid=c.oid
WHERE i.relname like 'pm_%';
table | index | Total | Table Size | Index Size | Num rows estimate
-----------------------------+------------------------------------------------------------+---------+------------+------------+-------------------
pm_licenses | pm_licenses_pkey | 936 kB | 528 kB | 104 kB | 3446
pm_licenses | i_pm_licenses_on_spdx_identifier | 936 kB | 528 kB | 264 kB | 3446
pm_package_version_licenses | pm_package_version_licenses_pkey | 873 MB | 322 MB | 229 MB | 7630966
pm_package_version_licenses | index_pm_package_version_licenses_on_pm_package_version_id | 873 MB | 322 MB | 163 MB | 7630966
pm_package_version_licenses | index_pm_package_version_licenses_on_pm_license_id | 873 MB | 322 MB | 158 MB | 7630966
pm_package_versions | pm_package_versions_pkey | 1009 MB | 407 MB | 163 MB | 7634649
pm_package_versions | index_pm_package_versions_on_pm_package_id | 1009 MB | 407 MB | 178 MB | 7634649
pm_package_versions | i_pm_package_versions_on_package_id_and_version | 1009 MB | 407 MB | 261 MB | 7634649
pm_packages | pm_packages_pkey | 83 MB | 38 MB | 10008 kB | 455570
pm_packages | i_pm_packages_purl_type_and_name | 83 MB | 38 MB | 35 MB | 455570
(10 rows)
Example query
This schema is designed not to be joined to the main monolith tables as the plan is for it to be represented in its own gitlab_schema
and to prevent cross joins.
Query plans
In order to simulate the query as above, license data is randomly sampled from the existing data above:
CREATE TABLE rand_package_versions AS SELECT ps.name, pvs.version FROM pm_package_versions pvs JOIN pm_packages ps ON pvs.pm_package_id = ps.id WHERE ps.purl_type = 1 ORDER BY random() LIMIT 1000;
The "client" (sampled) data is queried in the FROM
clause via IN
.
SELECT l.spdx_identifier
FROM pm_packages AS p
JOIN pm_package_versions AS pv
ON p.id = pv.pm_package_id
JOIN pm_package_version_licenses AS pvl
ON pvl.pm_package_version_id = pv.id
JOIN pm_licenses AS l
ON pvl.pm_license_id = l.id
WHERE p.purl_type = 1 AND (p.name, pv.version) IN (SELECT name, version FROM rand_package_versions);
Note that we are querying against 1000 packages (in rand_package_versions
) which is likely excessive. For example GitLab
has about 500 dependencies (direct + transitive) in its Gemfile.lock
.
The query plan is shown against the (non-partitioned) schema:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=170.82..5755.21 rows=2005 width=30) (actual time=0.866..15.935 rows=1000 loops=1)
Hash Cond: (pvl.pm_license_id = l.id)
-> Nested Loop (cost=27.29..5606.40 rows=2005 width=8) (actual time=0.245..15.199 rows=1000 loops=1)
-> Nested Loop (cost=26.86..4667.31 rows=2006 width=8) (actual time=0.237..10.683 rows=1000 loops=1)
-> Nested Loop (cost=26.42..2262.33 rows=990 width=16) (actual time=0.231..5.181 rows=1000 loops=1)
-> HashAggregate (cost=26.00..35.90 rows=990 width=49) (actual time=0.217..0.310 rows=1000 loops=1)
Group Key: rand_package_versions.name, rand_package_versions.version
-> Seq Scan on rand_package_versions (cost=0.00..21.00 rows=1000 width=49) (actual time=0.004..0.047 rows=1000 loops=1)
-> Index Scan using i_pm_packages_purl_type_and_name on pm_packages p (cost=0.42..2.25 rows=1 width=49) (actual time=0.005..0.005 rows=1 loops=1000)
Index Cond: ((purl_type = 1) AND (name = rand_package_versions.name))
-> Index Scan using i_pm_package_versions_on_package_id_and_version on pm_package_versions pv (cost=0.43..2.43 rows=1 width=24) (actual time=0.005..0.005 rows=1 loops=1000)
Index Cond: ((pm_package_id = p.id) AND (version = rand_package_versions.version))
-> Index Scan using index_pm_package_version_licenses_on_pm_package_version_id on pm_package_version_licenses pvl (cost=0.43..0.46 rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=1000)
Index Cond: (pm_package_version_id = pv.id)
-> Hash (cost=100.46..100.46 rows=3446 width=38) (actual time=0.610..0.610 rows=3446 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 280kB
-> Seq Scan on pm_licenses l (cost=0.00..100.46 rows=3446 width=38) (actual time=0.003..0.352 rows=3446 loops=1)
Planning Time: 1.609 ms
Execution Time: 16.017 ms
(19 rows)
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.