Partition package metadata tables

Problem to solve

The schema of package metadata has a boundary around the package registry type: purl_type. This allows an efficient partitioning of the database for package metadata for insert and select queries which is quite important as the dataset is expected to grow large.

Sample data stats

Using sample import data for the maven package registry loaded into schema introduced in Update DB schema to store data imported from th... (#373163 - closed), we can present some stats.

gitlabhq_development=# select count(*) from pm_packages where purl_type = 5;
 count  
--------
 455576
(1 row)

Per-package version data:

gitlabhq_development=# select count(*) from pm_packages ps join pm_package_versions pvs on ps.id = pvs.pm_package_id where purl_type = 5;
  count  
---------
 7631100
(1 row)

Per-package version license data:

gitlabhq_development=# select count(*) from pm_packages ps join pm_package_versions pvs on ps.id = pvs.pm_package_id join pm_package_version_licenses pvls on pvs.id = pvls.pm_package_version_id where purl_type = 5;
  count  
---------
 7631183
(1 row)

Package version stats:

gitlabhq_development=# select percentile_cont(0.75) within group (order by cnt) as percentile_75, percentile_cont(0.90) within group (order by cnt) as percentile_90, percentile_cont(0.95) within group (order by cnt) as percentile_95, percentile_cont(0.99) within group (order by cnt) as percentile_99 from (select name, count(*) as cnt from pm_packages ps inner join pm_package_versions pvs on ps.id = pvs.pm_package_id  where purl_type = 1 group by (purl_type,name)) as A;
 percentile_75 | percentile_90 | percentile_95 | percentile_99 
---------------+---------------+---------------+---------------
            13 |            36 |            64 |           203

For the table schema: see pm_ tables and relations in !102794 (merged) and https://gitlab.com/gitlab-org/gitlab/-/raw/373163-add-package-metadata/db/structure.sql specifically.

Table stats (there is fake data added for each purl_type so first create a clean maven-only table):

drop table if exists test_data_pm_packages_5;
create table test_data_pm_packages_5 (
    like pm_packages
    including defaults
    including constraints
    including indexes
);
insert into test_data_pm_packages_5 select * from pm_packages where purl_type = 5;
drop table if exists test_data_pm_package_versions_5;
create table test_data_pm_package_versions_5 (
    like pm_package_versions
    including defaults
    including constraints
    including indexes
);
insert into test_data_pm_package_versions_5 select pvs.* from pm_packages ps join pm_package_versions pvs on ps.id = pvs.pm_package_id where purl_type = 5;
drop table if exists test_data_pm_package_version_licenses_5;
create table test_data_pm_package_version_licenses_5 (
    like pm_package_version_licenses
    including defaults
    including constraints
    including indexes
);
insert into test_data_pm_package_version_licenses_5 select pvls.* from pm_packages ps join pm_package_versions pvs on ps.id = pvs.pm_package_id join pm_package_version_licenses pvls on pvls.pm_package_version_id = pvs.id where purl_type = 5;
SELECT i.relname "table",indexrelname "index",
 pg_size_pretty(pg_total_relation_size(relid)) As "Total",
 pg_size_pretty(pg_relation_size(relid)) as "Table Size",
 pg_size_pretty(pg_relation_size(indexrelid)) "Index Size",
 reltuples::bigint "Num rows estimate"
 FROM pg_stat_all_indexes i JOIN pg_class c ON i.relid=c.oid 
 WHERE i.relname like 'test_data_pm_%';


                  table                  |                              index                              |  Total  | Table Size | Index Size | Num rows estimate 
-----------------------------------------+-----------------------------------------------------------------+---------+------------+------------+-------------------
 test_data_pm_package_version_licenses_5 | test_data_pm_package_version_licenses_5_pkey                    | 991 MB  | 322 MB     | 298 MB     |                 0
 test_data_pm_package_version_licenses_5 | test_data_pm_package_version_licenses_pm_package_version_id_idx | 991 MB  | 322 MB     | 213 MB     |                 0
 test_data_pm_package_version_licenses_5 | test_data_pm_package_version_licenses_5_pm_license_id_idx       | 991 MB  | 322 MB     | 158 MB     |                 0
 test_data_pm_package_versions_5         | test_data_pm_package_versions_5_pkey                            | 1009 MB | 407 MB     | 164 MB     |           7631146
 test_data_pm_package_versions_5         | test_data_pm_package_versions_5_pm_package_id_idx               | 1009 MB | 407 MB     | 177 MB     |           7631146
 test_data_pm_package_versions_5         | test_data_pm_package_versions_5_pm_package_id_version_idx       | 1009 MB | 407 MB     | 261 MB     |           7631146
 test_data_pm_packages_5                 | test_data_pm_packages_5_pkey                                    | 80 MB   | 35 MB      | 16 MB      |            455576
 test_data_pm_packages_5                 | test_data_pm_packages_5_purl_type_name_idx                      | 80 MB   | 35 MB      | 28 MB      |            455576

Proposal

purl_type is a natural partition boundary with partitions created by list. 8 types are currently supported.

Several partitions are possible:

partition pm_packages on purl_type
- pros: neat partition
- cons: ignores larger tables
partition pm_packages and pm_package_versions on `purl_type
- pros: partitions one of the largest tables
- cons: there's still a large table (pm_package_version_licenses) left over
partition all 3 tables pm_package_version_licenses as above
- pros: considers the larger tables
- cons: in order to pass duplicate handling to the database (e.g. INSERT ... ON CONFLICT DO UPDATE) the usage of foreign key constraints becomes more complicated as some combination of primary key columns and (in certain cases) unique indexes must exist in the list partition (which currently only accepts one column).

Out of the 3 proposals, the last seems to make the most sense and is a scalable solution when partitioning on purl_type. The last major "gotcha" or complication (this is true for all 3 proposals) is that when a new purl_type is added, the schema must be updated to add a new partition over that type.

Implementation plan

Create list partitions on purl_type and denormalize by adding the purl_type column to pm_package_versions and pm_package_version_licenses so that these tables can be list partitioned as well.

update schema using migration guide in https://docs.gitlab.com/ee/development/database/table_partitioning.html
pm_packages
- remove primary key on id column
- add unique index for foreign keys and inserts with purl_type
- create list partition on purl_type
pm_package_versions
- denormalize to add purl_type to table
- add unique index for foreign keys and inserts with purl_type
- create list partition on purl_type
pm_package_version_licenses
- denormalize to add purl_type to table
- add unique index for foreign keys and inserts with purl_type
- create list partition on purl_type
add documentation to mention that partitions must be added for new package registry support in package metadata

Testing

add bulk insert test script to populate the changed tables (for repeatable results)
provide table stats
provide explain on the main anticipated query to fetch spdx_identifiers for a project's dependnecies

Ensure

postgresql can still handle duplicates through the on conflict do update clause
ensure that indexes utilized are still optimally configured with purl_type specified as constant

Edited Jan 12, 2023 by Igor Frenkel