Cache tags in the GitLab.com container registry to improve the performance of the API
Context
Reported in Mechanical Sympathy: https://gitlab.slack.com/archives/CM5EQH125/p1619576575026200
The performance of GET /api/:version/groups/:id/registry/repositories
is poor, with a p95 of over 15s.
Performance: p95: 15.244s max: 60.876s (49351 samples)
Problem
This group API endpoint allows users to optionally request a tag list of all repositories within the group, and have that included in the response (these are the ones that are causing problems). Therefore, for each one of these requests, Rails has to:
- Get a list of all container repositories within a group - query against the Rails DB;
- For each container repository, obtain the list of tags from the registry API - this means
1N
network requests against a slow registry endpoint.
All of this happens in a single request. Therefore, above performance, I think we have an API design issue here.
Proposal
Cache the tag list on the Rails side. Using GitLab.com as an example, the registry has a very low write rate, which means the tag list changes infrequently for most repositories. Therefore, it can be feasible to cache it on the Rails side, provided that we can invalidate the cache properly.
The Package team had a sync to conduct this investigation (https://youtu.be/utrr52arzxQ), and came up with the following conclusions:
-
The docker client requests a JWT token before every pull/push, regardless of whether the target repository is public or private.
-
Given 1) the self-managed install constraints (inability to accurately/reliably identify the type of container registry and the fact that we can't rely on JWT tokens if they are using a third-party registry) and 2) the fact that our topmost priority is to stabilize this endpoint's performance for GitLab.com: Caching will be restricted to GitLab.com, at least for now.
-
This feature will be placed behind a feature flag scoped by
Group
. Only a few groups (#329637 (comment 634053259)) use this API endpoint with thetags
/tags_count
options set totrue
(which leads to this performance problem). To minimize impact while still remediating the performance concern, we should restrict this feature to those groups. We can expand the list as necessary. -
We can later extend this feature to other/all groups (with a percentage-based rollout) if it works as expected and that is desirable, or we can simply wait for the definitive solution (API re-architecture/breaking change) and sunset this change. The latter is likely the best option, but this remains a two-way door decision.
-
There will be a cache entry with the list of tag names for each (eligible) container repository. The cache will live in Redis. No key expiry is necessary, as we will be invalidating entries manually (see below);
-
The cache will be invalidated in the authentication service whenever Rails serves a JWT token with
push
ordelete
permissions. We may end up invalidating the cache more often than we would need to, but this is the best we can do (assume that a tag will always be added or deleted).Note: This must happen both for external (requested through the API) and internal (used by e.g., cleanup policies) token requests. See #327311 (comment 675863703) for details about cleanup policies.
-
To avoid impacting other use cases (other API endpoints, cleanup policies, UI, Geo replication, etc.), using the cache will be restricted to this API endpoint. This means that we can't manipulate the cache in the low-level registry client class, as that would apply to all registry interactions.
Further details
Ideally, this should be done using two separate endpoints, one for the repositories list and another for the tags list for each repository. There is an issue planned for this (#336912 (closed)), but this represents a breaking change, so we will not be able to address it before %15.0.