Update repository details API to expose the size of grouped image repositories
Context
In #493 (closed) we added the ability to calculate the deduplicated size of individual repositories.
For context, this is the query we currently use for individual repositories:
Query
SELECT
coalesce(sum(q.size), 0)
FROM ( WITH RECURSIVE cte AS (
SELECT
m.id AS manifest_id
FROM
manifests AS m
WHERE
m.top_level_namespace_id = $1
AND m.repository_id = $2
AND EXISTS (
SELECT
FROM
tags AS t
WHERE
t.top_level_namespace_id = m.top_level_namespace_id
AND t.repository_id = m.repository_id
AND t.manifest_id = m.id)
UNION
SELECT
mr.child_id AS manifest_id
FROM
manifest_references AS mr
JOIN cte ON mr.parent_id = cte.manifest_id
WHERE
mr.top_level_namespace_id = $1
AND mr.repository_id = $2)
SELECT DISTINCT ON (l.digest)
l.size
FROM
layers AS l
JOIN cte ON l.top_level_namespace_id = $1
AND l.repository_id = $2
AND l.manifest_id = cte.manifest_id) AS q;
Now we need to extend that to support the calculation of the deduplicated size of grouped repositories. For example, if I have two repositories a/b/c
and a/b/d
, we should be able to tell what is the deduplicated size of a
and a/b
.
Proposal
The same rationale explained in #493 (closed) applies to this. The difference is that instead of limiting the calculation to the sum of all unique layers within a single repository, here we need to expand that across multiple repositories.
Deduplicated size of the top-level namespace
Here we just need to exclude the AND m.repository_id = $2
filter from the previous query, and then all calculations will happen at the top-level namespace level:
SELECT
coalesce(sum(q.size), 0)
FROM ( WITH RECURSIVE cte AS (
SELECT
m.id AS manifest_id
FROM
manifests AS m
WHERE
m.top_level_namespace_id = $1
AND EXISTS (
SELECT
FROM
tags AS t
WHERE
t.top_level_namespace_id = m.top_level_namespace_id
AND t.manifest_id = m.id)
UNION
SELECT
mr.child_id AS manifest_id
FROM
manifest_references AS mr
JOIN cte ON mr.parent_id = cte.manifest_id
WHERE
mr.top_level_namespace_id = $1)
SELECT DISTINCT ON (l.digest)
l.size
FROM
layers AS l
JOIN cte ON l.top_level_namespace_id = $1
AND l.manifest_id = cte.manifest_id) AS q;
Query plan: https://explain.depesz.com/s/yPS2. This was extracted from the production database using the gitlab-org
top-level namespace as the argument (1193 repositories at the time of writing).
Deduplicated size of intermediate groups of repositories
Considering we want to know the deduplicated size of a/b
, here we need the size of all repositories under a/b/*
;
SELECT
coalesce(sum(q.size), 0)
FROM ( WITH RECURSIVE -- the standard requires putting RECURSIVE here if any cte is recursive
repository_ids AS MATERIALIZED (
-- materialized since we want to load the ids once then reuse them
SELECT
id
FROM
repositories
WHERE
top_level_namespace_id = $1
AND path LIKE $2
),
cte AS (
SELECT
m.id AS manifest_id
FROM
manifests AS m
WHERE
m.top_level_namespace_id = $1
AND m.repository_id IN (
SELECT
id
FROM
repository_ids)
AND EXISTS (
SELECT
FROM
tags AS t
WHERE
t.top_level_namespace_id = m.top_level_namespace_id
AND t.repository_id = m.repository_id
AND t.manifest_id = m.id)
UNION
SELECT
mr.child_id AS manifest_id
FROM
manifest_references AS mr
JOIN cte ON mr.parent_id = cte.manifest_id
WHERE
mr.top_level_namespace_id = $1
AND mr.repository_id IN (
SELECT
id
FROM
repository_ids))
SELECT DISTINCT ON (l.digest)
l.size
FROM
layers AS l
JOIN cte ON l.top_level_namespace_id = $1
AND l.repository_id IN (
SELECT
id
FROM
repository_ids)
AND l.manifest_id = cte.manifest_id) AS q;
Query plan: https://explain.depesz.com/s/LoLS. This was extracted from the production database for all repositories under gitlab-org/gitlab-services/*
(57 repositories at the time of writing).
To make this perform better we need an index on (top_level_namespace_id, path, id)
for repositories
. See #519 (comment 832423779) for more details.