Container Registry: Scaling limitations on top-level namespace usage calculation

Context

Historical

The conversations around usage calculations started in https://gitlab.com/gitlab-org/container-registry/-/issues/317+.

Later, in Calculate deduplicated size of individual image... (#493 - closed), we implemented the current solution for individual repositories, and in Update repository details API to expose the siz... (#519 - closed) for groups of nested repositories (which applies to the top-level namespace). The description of these two issues is extremely helpful to understand how usage calculation works and the problem we're now facing.

Problem

After completing the GitLab.com migration, and more importantly, the migration of our largest customers (last step), we are now seeing an increasing number of failures when attempting to calculate the deduplicated size of nested repositories.

This issue focuses solely on the top-level namespace usage calculation. Project-level usage calculation is suffering from the same problem (although at a much smaller scale) but will be dealt with separately (but potentially in the same way) in Scaling limitations on project usage calculation (#822 - closed).

The priority is to address the top-level namespace usage calculation, as that is the final figure that enables customers to know their overall usage quota/limits.

After looking at a few occurrences, we have identified that the affected namespaces have several millions of unique layers across all their repositories. This causes the current method/query to time out. After this finding, it's clear that the current approach won't suit all namespaces and will only get worse as the data set grows (at least until/if usage decreases due to users cleaning up old data).

Impact

At the time of writing, this problem affects 0.95% of usage calculations (source).

Rational/context behind current implementation

Measuring usage was not the driver behind the new registry metadata DB design/layout (online GC was). That necessity/requirement arrived later, during the GitLab.com migration. At that time, we had little visibility over the existing data size and distribution (beyond opaque tag counts), as the migration was still in its early days.

Back then, we wanted to pursue a solution that could offer:

Near realtime feedback on usage allocation - When a user deletes/adds new data, that change should reflect in the measured usage as soon as possible. This is the main reason why we decided to calculate usage by only accounting for tagged (directly or indirectly) layers and not just all layers linked to repositories.
Had zero impact/risk on the ongoing GitLab.com migration.
Fast and performant enough to cope with the GitLab.com scale, considering what we knew by then.

So we end up doing the best we could with what we knew and the tools we had, and a lot has changed since then - among which the gained visibility over all existing data, including all the major users that were onboarding in the last few months. As result, (3) turned out to not be true.

Possible solutions

A and B are relatively easy to implement, but not a universal solution. Thus they are short-term mitigations intended to alleviate the problems we're seeing. May complement each other.

My proposal is to move forward with A now and possibly B after. We should re-evaluate the impact once these are in place. If it's low enough, we may want to proceed with a "manual" short/mid-term approach where usage is calculated on demand (Customer Support?). This may be possible to achieve using a production clone with loose timeouts. Ideally, those namespaces are reduced in size so that this is no longer a problem. Alternatively, we need to wait for a definitive solution. We should also address the identified blockers/threats for C and D in case we have to act on them.

A (&9413)

Type: Short-term mitigation
Trade off: No-delay/realtime feedback
Blockers: Needs a new index, which needs to be applied with post-deployment migrations.
Threats: None
Scope: golang backend database rails frontend

Catch 5s timeout when performing the current query (maximum precision and no usage update delay). When it happens, fall back to a simpler alternative query that does not take into account what layers are referenced/tagged or not.

This will certainly still fail for a portion of the currently affected namespaces, but we need to quantify them. As a downside, not taking into account references means that the measured usage is not precise. If a user deletes hundreds of tags from their repositories, the namespace usage will only see the corresponding change after GC runs and wipes the images that became unreferenced (24h+ delay).

If the alternative query succeeds, we should include a flag in the registry API response to let Rails know that the obtained usage is only an estimate. We should then surface this in the UI with e.g. a warning/banner beside the displayed registry usage. This is not great, but at least we're being clear/transparent about it.

As an efficiency improvement, to avoid repeating failing query(ies) unnecessarily (timeout running the main and/or simplified query), we can flag the target namespace as being "too large" for a successful usage calculation on the registry. This flag can carry a 25h TTL (24h GC delay plus some slack) to avoid bursts of failures during that period.

Note that this query requires a new index, which can only be applied with post-deployment migrations. While we support doing so, there is no automation for applying such migrations on .com, so we'll need a change request to apply them manually using the registry CLI.

B (&9414)

Type: Short-term mitigation
Trade off: No-delay/realtime feedback + Accuracy
Blockers: A
Threats: None
Scope: golang backend database rails frontend

If A doesn't yield good enough results, we can go one step further at the expense of accuracy: falling back from the main query to an even simpler one that does not deduplicate layers.

The measured usage will not have into account how many times a layer is reused across images/repositories/projects/groups of a namespace, so precision is lost. We can either accept this, or deduct an estimated/guessed deduplication factor.

We could get "smart" about estimating the deduplication factor of a namespace by looking at a portion of its data. However, aside from the cost (development and runtime) of doing so, the accuracy of such samples would tend to be inversely proportional to the size of a namespace, so less relevant/accurate for the really large ones. Using a flat rate would also be irrelevant IMO as it's not tailored to each namespace usage patterns.

So I think we'd be better off accepting the loss of accuracy. Regardless, as with A, we must communicate to Rails and users that the measured usage is not only delayed but also an estimate with no deduplication.

As with A, most likely we'll still see a few timeouts with this approach, but certainly way fewer.

C (#852)

Type: Mid-term mitigation. Possible long-term solution.
Trade off: None
Blockers: Assessing the effectiveness of A and/or B. Support for background migrations.
Threats: Not a long-term solution in the presence of unbounded namespace growth
Scope: golang backend database

Assign an internal ID to blobs and use that for the deduplication portion of the query instead of their digest. Based on the conversations above, this could represent a major cut on the query cost due to the number of bytes we need to fit in memory with the current digest deduplication.

The layers table is already related to blobs, but blobs are univocally identified by digest, there is no ID for them. We'd have to:

Add ID (bigint) to blobs table;
Add blob ID column to layers table;
Start filling blob ID for all new layers;
Backfill blob ID for all existing layers.

We'd better not replace the blob digest with the blob ID as we rely on the former pretty much everywhere (API and GC queries, indexes, GC triggers), so that would be a huge amount of work. In this case, the blob ID would be used for usage calculation purposes only.

The lack of support for background migrations is the main blocker here. Without that, we can't backfill at scale. Regardless, this is already a problem, so nothing new. We need to address that regardless.

D (#844)

Type: Long-term solution
Trade off: Simplicity
Blockers: Assessing the effectiveness of A/B/C. Existence of a ClickHouse deployment for GitLab.com.
Threats: Technical feasibility
Scope: golang backend database

This would be the most drastic (and likely effective) approach, in case all before are deemed insufficient. It would require replicating a portion of the registry data to ClickHouse and outsourcing the usage calculation to the latter.

The PostgreSQL and ClickHouse databases would be kept in sync with logical replication. The registry would connect to ClickHouse for usage calculation queries.

Some registry database changes for data normalization would be required prior to this. There is a separate issue to fully assess the technical feasibility of this option and its requirements (#844).

Edited Dec 06, 2022 by João Pereira