Calculate deduplicated size of individual image repositories
Context
So far, we haven't been able to provide visibility over the size of container repositories. This is because our ability has been severely limited by the constraints of the storage backend where the metadata is saved.
This is now changing as we ship the new version of the registry backend by a metadata database (&5523 (closed)) which can facilitate this calculation.
Problem
Here is a visual explanation of how the deduplicated size of a single repository should be calculated. All images described here are in the same repository.
We'll start with a tagged manifest with two layers:
graph TD;
tA((Tag A))
lA([Layer A])
lB([Layer B])
%%lC(Layer C)
%%lD(Layer D)
%%lE(Layer E)
mA[Manifest A]
%%mB[Manifest B]
%%mC[Manifest C]
%%mlA[Manifest List A]
%%mlB[Manifest List B]
tA-->mA;
mA-->lA;
mA-->lB;
Because Manifest A
is tagged, we need to account for the size of Layer A
and Layer B
when calculating the size of the repository. So at this point, the size of the repository is Layer A + Layer B
.
Another tagged manifest using the same two layers as Manifest A
with an additional one:
graph TD;
tB((Tag B))
lA([Layer A])
lB([Layer B])
lC([Layer C])
%%lD(Layer D)
%%lE(Layer E)
%%mA[Manifest A]
mB[Manifest B]
%%mC[Manifest C]
%%mlA[Manifest List A]
%%mlB[Manifest List B]
tB-->mB
mB-->lA;
mB-->lB;
mB-->lC;
Because multiple tagged manifests use Layer A
and Layer B
, we must only count them once, i.e., the size of the repository is not (Layer A x 2) + (Layer B x 2) + Layer C
but instead Layer A + Layer B + Layer C
.
Another manifest, but this time it is untagged.
graph TD;
lB(Layer B)
lD(Layer D)
mC[Manifest C]
mC-->lB;
mC-->lD;
Manifest C
is untagged and, therefore, garbage collected (unless tagged or referenced in a list in between). So, despite having a new layer that we haven't seen so far (Layer D
), the size of this layer doesn't count for calculating the size of the repository.
Now on to manifest lists:
graph TD;
tB((Tag C))
lA([Layer A])
lB([Layer B])
lE([Layer E])
mA[Manifest A]
mD[Manifest D]
mlA[Manifest List A]
tB-->mlA;
mlA-->mA;
mA-->lA;
mA-->lB;
mlA-->mD;
mD-->lE;
The Manifest List A
is tagged and references Manifest A
and Manifest B
. We have already accounted for Layer A
and Layer B
previously, so we should ignore them now. But there is also Manifest D
. This manifest is not tagged but is referenced in this tagged manifest list, so we must account for Layer E
.
Therefore, the repository size is Layer A + Layer B + Layer C + Layer E
.
Manifest lists can also reference other manifest lists, and there is no depth limit:
graph TD;
tD((Tag D))
lA([Layer A])
lB([Layer B])
lE([Layer E])
lF([Layer F])
lG([Layer G])
mA[Manifest A]
mD[Manifest D]
mF[Manifest E]
mlA[Manifest List A]
mlB[Manifest List B]
tD-->mlB;
mlB-->mlA;
mlA-->mA;
mA-->lA;
mA-->lB;
mlA-->mD;
mD-->lE;
mlB-->mF;
mF-->lF;
mF-->lG;
We have another tagged manifest list. It references Manifest List A
, which we already accounted for in Tag C
. So we should ignore it. But then there is also Manifest E
, which we haven't seen before. We must account for the layers referenced by it.
Therefore, the final repository size is Layer A + Layer B + Layer C + Layer E + Layer F + Layer G
.
Solution
Implement a database query and an application function that can be used to efficiently and accurately calculate the deduplicated size of a given repository, taking into account all constraints described above.
Please note that this issue is just for the database query and the corresponding method at the application level. We will then need a new API route to make it possible to execute this query for a given repository and obtain the size.