Determining if a repository is old or new causes significant increased latency

Context

As described in #374 (closed), during Phase 1 of the GitLab.com container registry upgrade/migration, for every incoming API request, we have to determine if the target repository (present in the request URL) is old or new.

Being old means that it exists under the old bucket prefix and is not registered on the metadata database. Being new means that it doesn't exist in the old bucket prefix and therefore it either exists on the database and new bucket prefix or it should be created there.

This operation is the step Exists under old bucket prefix? in the following diagram:

graph TD
    A([Check target repository]) --> B{Exists under old bucket prefix?}
    B -- "Yes (existing repository)" --> C([Process with old code])
    C -- "Using old prefix" --> GCS[(Storage backend)]
    B -- "No (new repository)" --> D{JWT token has migration flag?}
    D -- "Yes" --> E{Flag set to `true`?}
    E -- "Yes" --> F([Process with new code])
    F --> DB[(Metadata DB)]
    F -- "Using new prefix" --> GCS
    E -- "No" --> C;
    D -- "No" --> G{Exists under new bucket prefix? *}
    G -- "Yes" --> F
    G -- "No" --> C

* This check is what allows us to pause the migration (if needed to debug a problem) by not adding any more new repositories to the database, while still being able to continue serving requests for those already there.

For Phase 1 we decided to check the old bucket prefix instead of checking the database because doing that means that we don't have to touch/depend on the database for serving requests for existing repositories during Phase 1. If a target repository exists on the old bucket prefix then we simply process it the old way. Therefore, in case of a database-related incident, requests for existing repositories cannot be affected.

Problem

After enabling the database and migration mode in canary we noticed a significant apdex degradation which was clearly related to slow Stat operations against GCS:

For the Exists under old bucket prefix? check we are using an internal operation called Stat. This operation is already heavily used to serve API requests, so the latency degradation was intriguing. We were expecting to see more Stat operations (and that is fine), not slower. Also, we didn't spot this problem while testing in Staging.

One thing that is different here is that unlike for the majority of use cases for this Stat operation, here we're not checking if an object exists, we're checking if a prefix exists. This led me to debug the code and find the reason for the slowdown.

The source code for the GCS Stat operation can be found here. This method wasn't changed for this project and is considered generic enough for most use cases. There we can see that this is a compound operation:

First we try to check if a given path exists as an object (here), using a GCS object metadata get request;
If it doesn't, we then try to see if the path exists as a prefix/directory (here), using a GCS list request.

So for this specific use case, it's unnecessary to do (1), because we know that the path we're looking for is either a prefix/directory or doesn't exist at all. This means we're doing 2 requests against GCS when we just need one.

The second problem, and definitely the worse, is that the GCS list request being used in (2) does not currently include any delimiter! This means that when checking if a prefix e.g. docker/registry/v2/repositories/myrepository exists, we're actually recursively listing its contents. Although the list request is limited to a single page, it's completely unnecessary to recursively list the contents with an unbounded depth.

A third problem is that the list request in (2) is currently requesting the full metadata details for every single item found under the target path. This is also unnecessary.

Overall, this means that the Stat method for GCS has a long-standing bug. This went unnoticed so far because the main use case for this operation is to check if a given object exists, which means that it either exists and we stop in (1) or it does not exist and (2) returns nothing (no performance penalty from a recursive listing).

Unlike for staging, this problem became severe enough in production and triggered alarms because most of the target repositories already exist (there is a very low write rate in production). Additionally, the request latency and response size for this unbounded list operation are proportional to the size/depth of the target repositories (in terms of object count, not storage space), which is expected to be far higher in production.

Solution

We should start by fixing the Stat operation bug. This will benefit this specific use case but also avoid the same problem for any other use cases.

We must use a delimiter of / for the list request within Stat, ensuring that we will only list one level, the base prefix, with no recursion.

Additionally, we must request the least possible amount of metadata for each item within the base path. This can be achieved by requesting a partial response from GCS. Requesting the name attribute is enough.

Then we have two options, specifically for this use case and project:

1. Create optimized `Stat` operation

Unfortunatelly GCS does not have an API operation to simply determine if a prefix exists or not. We have to use a list request for this. Therefore, we can create a Stat operation optimized for checking prefixes and use that.

This operation would only do a single network request against GCS, and that would be a list request (2). This list request would be implemented correctly and with performance in mind as described above.

2. Use the database to determine if a repository is old

This is the only alternative. It would require an "exists" query against the metadata database for every HTTP request, so it would be way faster. However, the downside is that database downtime would impact requests for existing repositories.

Decision

Optimizing the Stat operation should drastically decrease the latency of the Exists under old bucket prefix? check. It will still represent one additional list request against GCS for every HTTP request, so there will be a cost involved, but that cost may well offset the downside of the alternative.

We can start by optimizing the Stat operation and see what that still represents in terms of performance penalty. It might be the case that we can live with a slight latency increase for the additional isolation/availability benefit that comes from not depending on the availability of the database to serve every single request.

If the performance penalty is still not acceptable, the only option is to use the database to check if a repository is old. We will have to depend on the database to serve all requests after Phase 1, so this is mostly about trying to delay that dependency until a later date or until we have no other option, as any critical problems (that may cause a database downtime) are more likely to appear during the initial period of the rollout.

Edited Oct 28, 2021 by João Pereira