Horizontally Scale redis-cache using Redis Cluster
DRI: @schin1
This epic migrates our existing `ratelimiting` instance to Redis Cluster, in preperation to then migrate `redis-cache` &878. The goal of this epic is to solve the cache's chronic risk of CPU saturation as the GitLab.com SaaS workload continues to grow (from both traffic and features), providing a means to scale the Redis workload across an arbitrary number of CPUs. Rather than using a single primary instance (with Sentinel for HA), the new cluster will shard the key storage among multiple primaries, splitting the CPU and memory footprint among them.
## Why implement Redis Cluster?
https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/823#note_1194941678
## Approach
We will begin by migrating `redis-ratelimiting` https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/823, which is much smaller, simpler to migrate, and more tolerant of failures. This will give us early feedback to guide the `redis-cache` migration.
### Definition of done
This epic is done when:
* `redis-cache` is running on Redis Cluster.
* All Redis nodes in the new cluster are below 70% CPU and memory utilization at their daily peak workload.
* All new failure modes we determine to be critical to address before production go-live have a well understood and documented recovery mechanism.
* Metrics, dashboards, alerts, and log ingestion have been added to provide observability for the new cluster.
* Profiling tools work with the cluster's redis build.
* Tamland supports capacity forecasting for Redis Cluster. (This forecasting may be per node, per shard, or per cluster.)
* Runbooks documentation provides an overview of Redis Cluster's architecture, observability, and troubleshooting.
### Related work streams
This epic prioritizes the critical path issues blocking our initial adoption of Redis Cluster.
Our primary goal is to migrate `redis-cache` to use Redis Cluster.
For expedience, we are first migrating `redis-ratelimiting` https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/823, which is much smaller, simpler to migrate, and more tolerant of failures. This will give us early feedback to guide the `redis-cache` migration. This will be the first practical use of the artifacts we are building to support a Redis Cluster deployment, and it will give us faster access to real-world data on resource utilization/overhead, capacity planning, error handling, fault detection and recovery behavior/tuning, and any observability gaps.
Concurrently, the complimentary epic https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/857 buys us time by slicing another functional partition off of redis-cache, reducing its CPU usage and isolating an important subset of its workload. (The new partition was initially going to be the feature flags cache, but [keyspace analysis suggests](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/857#note_1197165845) that repository-cache may be more beneficial.)
Concurrently, another complimentary epic https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/979 facilitates the [dual-write strategy](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2016) by sharding out feature-flag workloads into its own instance.
Initial outline of our approach: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/823#note_1194941678
### Milestones
These anticipated milestones are listed roughly in anticipated completion order. Some will be in progress concurrently.
1. https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1992 & https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2320 - Identify, prioritize, and address critical new failure modes for `redis-cache`
1. https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2073 - Implement migration support for `redis-cache`
1. https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2074 - Deploy `redis-cache` cluster to nonprod
1. https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2075 - Deploy `redis-cache` cluster to production
## Status 2023-08-22
The clean-up is completed and `redis-cache` instance is removed.
To summarise the result of this epic:
- All cache-related workloads (rate-limiting, cache, repository-cache, chat, and feature-flag) for GitLab Rails are now Redis Cluster compatible.
- `redis-cache` is replaced with [`redis-cluster-cache`](https://dashboards.gitlab.net/d/redis-cluster-cache-main/redis-cluster-cache-overview?orgId=1) for over a month with plenty of CPU headroom.
- [GDK](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2032) and [Gitlab repo's CI pipeline](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2327) has been updated to improve developer's experience working with Redis Cluster.
epic