Horizontally Scale redis-cache using Redis Cluster
DRI: @schin1 This epic migrates our existing `ratelimiting` instance to Redis Cluster, in preperation to then migrate `redis-cache` &878. The goal of this epic is to solve the cache's chronic risk of CPU saturation as the GitLab.com SaaS workload continues to grow (from both traffic and features), providing a means to scale the Redis workload across an arbitrary number of CPUs. Rather than using a single primary instance (with Sentinel for HA), the new cluster will shard the key storage among multiple primaries, splitting the CPU and memory footprint among them. ## Why implement Redis Cluster? https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/823#note_1194941678 ## Approach We will begin by migrating `redis-ratelimiting` https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/823, which is much smaller, simpler to migrate, and more tolerant of failures. This will give us early feedback to guide the `redis-cache` migration. ### Definition of done This epic is done when: * `redis-cache` is running on Redis Cluster. * All Redis nodes in the new cluster are below 70% CPU and memory utilization at their daily peak workload. * All new failure modes we determine to be critical to address before production go-live have a well understood and documented recovery mechanism. * Metrics, dashboards, alerts, and log ingestion have been added to provide observability for the new cluster. * Profiling tools work with the cluster's redis build. * Tamland supports capacity forecasting for Redis Cluster. (This forecasting may be per node, per shard, or per cluster.) * Runbooks documentation provides an overview of Redis Cluster's architecture, observability, and troubleshooting. ### Related work streams This epic prioritizes the critical path issues blocking our initial adoption of Redis Cluster. Our primary goal is to migrate `redis-cache` to use Redis Cluster. For expedience, we are first migrating `redis-ratelimiting` https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/823, which is much smaller, simpler to migrate, and more tolerant of failures. This will give us early feedback to guide the `redis-cache` migration. This will be the first practical use of the artifacts we are building to support a Redis Cluster deployment, and it will give us faster access to real-world data on resource utilization/overhead, capacity planning, error handling, fault detection and recovery behavior/tuning, and any observability gaps. Concurrently, the complimentary epic https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/857 buys us time by slicing another functional partition off of redis-cache, reducing its CPU usage and isolating an important subset of its workload. (The new partition was initially going to be the feature flags cache, but [keyspace analysis suggests](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/857#note_1197165845) that repository-cache may be more beneficial.) Concurrently, another complimentary epic https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/979 facilitates the [dual-write strategy](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2016) by sharding out feature-flag workloads into its own instance. Initial outline of our approach: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/823#note_1194941678 ### Milestones These anticipated milestones are listed roughly in anticipated completion order. Some will be in progress concurrently. 1. https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1992 & https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2320 - Identify, prioritize, and address critical new failure modes for `redis-cache` 1. https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2073 - Implement migration support for `redis-cache` 1. https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2074 - Deploy `redis-cache` cluster to nonprod 1. https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2075 - Deploy `redis-cache` cluster to production ## Status 2023-08-22 The clean-up is completed and `redis-cache` instance is removed. To summarise the result of this epic: - All cache-related workloads (rate-limiting, cache, repository-cache, chat, and feature-flag) for GitLab Rails are now Redis Cluster compatible. - `redis-cache` is replaced with [`redis-cluster-cache`](https://dashboards.gitlab.net/d/redis-cluster-cache-main/redis-cluster-cache-overview?orgId=1) for over a month with plenty of CPU headroom. - [GDK](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2032) and [Gitlab repo's CI pipeline](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2327) has been updated to improve developer's experience working with Redis Cluster.
epic