Discussion: Redis instance/cluster for Container Registry

Context

The registry supports Redis (including Sentinel) as one of the two caching backend options. However, we don't use Redis for GitLab.com or self-managed registries.

Besides caching, which is already a problem we have to solve (see Caching below for more details), we have identified other possible use cases for Redis already. This issue is meant to list all possible use cases and discuss providing the registry with access to a Redis instance/cluster, both for GitLab.com and self-managed, making sure we have a plan before these use cases make it a hard requirement.

Note: Although the metadata database could be used to implement a solution for some of these use cases, it's generally a good idea not to put "all eggs in the same basket". Additionally, there are plenty of algorithms/libs built around Redis to support most of these use cases, so we can avoid reinventing the wheel (and risk doing it wrong).

Use cases

Caching

For GitLab.com, we use the per-instance in-memory cache backend (source). This is not reliable or efficient (data duplicated across all instances - waste of memory resources) and has already caused production incidents (gitlab-com/gl-infra/delivery#1022 (closed), https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/7). We've considered using Redis as the cache backend in the past (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10868).

Although the upcoming metadata database will ease the concerns around caching object storage data, it would be good to reduce the database load and decrease the impact of a database outage (read redundancy) by using Redis as the cache backend.

This would be an immediate benefit, as the registry already supports Redis as a cache backend. As result, we could reduce OOM incidents while having a pre-warmed, shared cache as deployments scale. Memory usage would drop significantly across the fleet and remain relatively stable.

Update: We disabled the inmemory cache in GitLab.com with success (gitlab-com/gl-infra/delivery#1878 (closed)). It will remain like so for the duration of the registry upgrade/migration. We'll start Phase 1 with no caching and review before Phase 2. If the load during Phase 1 goes above the desired threshold we can pause the migration (stop adding any more new repositories to the database) and revisit this conversation.

Reliable webhook notifications

The registry webhook notifications, as-is, are not reliable. They do not offer delivery (or order) guarantees because they rely on vulnerable per-instance in-memory queues.

These notifications are currently only used for GitLab.com and for metrics purposes only. However, recently we identified other possible use cases, such as delivering notifications to Rails whenever a repository migration completes. This is a possibility for Phase 2 of the gradual migration plan (#374 (closed)).

These additional use cases require notifications to be reliable. We could offer robust delivery guarantees if using Redis as the queue/broker for these notifications.

Per-repository read-only mode

As described in the gradual migration plan (#374 (closed)), we'll need to lock repositories against writes during their migration (Phase 2). There was one suggestion to use Redis for these distributed locks.

Database loadbalancing

The registry supports multi-host connections and custom target session attributes. However, it does not yet support active load-balancing between primary (read-write) and secondary (read-only) nodes.

Having a place (that's not the metadata database, which is subject to replication lag) to keep track of the latest successful write request timestamp for each repository would allow us to actively distribute requests across primary and secondary nodes.

Limit the number of actors in a cluster

It would be useful to control how many instances of the registry in a cluster can perform a given action. For example, currently, online GC is an active background process in each registry instance. Although we interleave these with randomized jitters and ensure the idempotency and synchronization of operations across instances, for clusters with a vast number of instances and/or a shallow write rate, it would be unnecessary to have an active worker per instance.

We could limit the number of active actors in a cluster by using distributed locks based on Redis.

Rate limiting

As described in the metadata blueprint, right now, the registry has no rate limiting, but we might have to consider it in the future as an additional layer of protection.

Redis could be used to implement distributed rate-limiting. For instance, KAS makes use of Redis for the same reason.

Background migrations

If we even need to support background database migrations, as Rails does, we'll need a system to enqueue and process background jobs. Redis can be used for this purpose.

What's the relation with the registry migration?

None of the highlighted use cases here are immediate requirements for the GitLab.com registry upgrade/migration. However:

Due to the in-memory cache leak, RAM on the registry pods is constantly being exhausted (see this). It might be difficult to spot a memory leak related to the new code (metadata database and online GC), as it would be easily masked by the current cache exhaustion. Additionally, the current exhaustion causes pods to be restarted often, which increases the chances for errors;
There are at least two use cases that may become a requirement for the migration Phase 2, namely reliable webhook notifications and per-repository read-only mode;
Ideally, we want to have database loadbalancing across primary and secondary nodes. While we might get away without it during most or all of Phase 1, it will most likely become necessary once we start handling a lot of requests with the database, especially when those belong to VIP customers with high HA expectations.

Having access to Redis would ensure we can fix the current cache issue and be able to tackle any of these new use cases once they become a requirement.

In order to avoid adding one more variable to the registry upgrade/migration equation, I think we would better doing this at one of two points in time:

Before starting Phase 1: This would be "just" to address the cache problem. Doing this change before starting the registry upgrade/migration would mean that we don't mix both subjects, avoiding running into problems with both at the same time.
Between Phase 1 and Phase 2: There will be a small stabilization period between the two phases. By then most of the issues with the new registry should have been already identified and fixed, so there is less risk of running into problems with the upgrade and the Redis integration at the same time. The downside here is that we may need to tackle some use cases before that to handle the load increase during Phase 1, namely database load balancing.

The purpose and timing of this issue are to ensure that we keep these problems and possible requirements in mind and have a plan to act on them.

Edited Aug 03, 2021 by João Pereira