Add package metadata tables and models (!102794) · Merge requests · GitLab.org / GitLab

Igor Frenkel requested to merge 373163-add-package-metadata into master Nov 02, 2022

What does this MR do and why?

Adds package metadata schema and models so that client code can retrieve license data for a project's sbom components.

Sample data

Data was loaded from a sample maven registry license list. Sample data loaded from here (internal link).

SELECT i.relname "table",indexrelname "index",
 pg_size_pretty(pg_total_relation_size(relid)) As "Total",
 pg_size_pretty(pg_relation_size(relid)) as "Table Size",
 pg_size_pretty(pg_relation_size(indexrelid)) "Index Size",
 reltuples::bigint "Num rows estimate"
 FROM pg_stat_all_indexes i JOIN pg_class c ON i.relid=c.oid 
 WHERE i.relname like 'pm_%';

               table         |                           index                            |  Total  | Table Size | Index Size | Num rows estimate 
-----------------------------+------------------------------------------------------------+---------+------------+------------+-------------------
 pm_licenses                 | pm_licenses_pkey                                           | 936 kB  | 528 kB     | 104 kB     |              3446
 pm_licenses                 | i_pm_licenses_on_spdx_identifier                           | 936 kB  | 528 kB     | 264 kB     |              3446
 pm_package_version_licenses | pm_package_version_licenses_pkey                           | 873 MB  | 322 MB     | 229 MB     |           7630966
 pm_package_version_licenses | index_pm_package_version_licenses_on_pm_package_version_id | 873 MB  | 322 MB     | 163 MB     |           7630966
 pm_package_version_licenses | index_pm_package_version_licenses_on_pm_license_id         | 873 MB  | 322 MB     | 158 MB     |           7630966
 pm_package_versions         | pm_package_versions_pkey                                   | 1009 MB | 407 MB     | 163 MB     |           7634649
 pm_package_versions         | index_pm_package_versions_on_pm_package_id                 | 1009 MB | 407 MB     | 178 MB     |           7634649
 pm_package_versions         | i_pm_package_versions_on_package_id_and_version            | 1009 MB | 407 MB     | 261 MB     |           7634649
 pm_packages                 | pm_packages_pkey                                           | 83 MB   | 38 MB      | 10008 kB   |            455570
 pm_packages                 | i_pm_packages_purl_type_and_name                           | 83 MB   | 38 MB      | 35 MB      |            455570
(10 rows)

Example query

This schema is designed not to be joined to the main monolith tables as the plan is for it to be represented in its own gitlab_schema and to prevent cross joins.

Query plans

In order to simulate the query as above, license data is randomly sampled from the existing data above:

CREATE TABLE rand_package_versions AS SELECT ps.name, pvs.version FROM pm_package_versions pvs JOIN pm_packages ps ON pvs.pm_package_id = ps.id WHERE ps.purl_type = 1 ORDER BY random() LIMIT 1000;

The "client" (sampled) data is queried in the FROM clause via IN.

  SELECT l.spdx_identifier
    FROM pm_packages AS p
    JOIN pm_package_versions AS pv
      ON p.id = pv.pm_package_id
    JOIN pm_package_version_licenses AS pvl
      ON pvl.pm_package_version_id = pv.id
    JOIN pm_licenses AS l
      ON pvl.pm_license_id = l.id
    WHERE p.purl_type = 1 AND (p.name, pv.version) IN (SELECT name, version FROM rand_package_versions);

Note that we are querying against 1000 packages (in rand_package_versions) which is likely excessive. For example GitLab has about 500 dependencies (direct + transitive) in its Gemfile.lock.

The query plan is shown against the (non-partitioned) schema:

                                                                                                 QUERY PLAN                                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=170.82..5755.21 rows=2005 width=30) (actual time=0.866..15.935 rows=1000 loops=1)
   Hash Cond: (pvl.pm_license_id = l.id)
   ->  Nested Loop  (cost=27.29..5606.40 rows=2005 width=8) (actual time=0.245..15.199 rows=1000 loops=1)
         ->  Nested Loop  (cost=26.86..4667.31 rows=2006 width=8) (actual time=0.237..10.683 rows=1000 loops=1)
               ->  Nested Loop  (cost=26.42..2262.33 rows=990 width=16) (actual time=0.231..5.181 rows=1000 loops=1)
                     ->  HashAggregate  (cost=26.00..35.90 rows=990 width=49) (actual time=0.217..0.310 rows=1000 loops=1)
                           Group Key: rand_package_versions.name, rand_package_versions.version
                           ->  Seq Scan on rand_package_versions  (cost=0.00..21.00 rows=1000 width=49) (actual time=0.004..0.047 rows=1000 loops=1)
                     ->  Index Scan using i_pm_packages_purl_type_and_name on pm_packages p  (cost=0.42..2.25 rows=1 width=49) (actual time=0.005..0.005 rows=1 loops=1000)
                           Index Cond: ((purl_type = 1) AND (name = rand_package_versions.name))
               ->  Index Scan using i_pm_package_versions_on_package_id_and_version on pm_package_versions pv  (cost=0.43..2.43 rows=1 width=24) (actual time=0.005..0.005 rows=1 loops=1000)
                     Index Cond: ((pm_package_id = p.id) AND (version = rand_package_versions.version))
         ->  Index Scan using index_pm_package_version_licenses_on_pm_package_version_id on pm_package_version_licenses pvl  (cost=0.43..0.46 rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=1000)
               Index Cond: (pm_package_version_id = pv.id)
   ->  Hash  (cost=100.46..100.46 rows=3446 width=38) (actual time=0.610..0.610 rows=3446 loops=1)
         Buckets: 4096  Batches: 1  Memory Usage: 280kB
         ->  Seq Scan on pm_licenses l  (cost=0.00..100.46 rows=3446 width=38) (actual time=0.003..0.352 rows=3446 loops=1)
 Planning Time: 1.609 ms
 Execution Time: 16.017 ms
(19 rows)

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Nov 08, 2022 by Igor Frenkel

Add package metadata tables and models

What does this MR do and why?

Sample data

Example query

Query plans

MR acceptance checklist

Merge request reports