Functional Partitioning for Repository Cache to mitigate Redis saturation forecasts
DRI: @stejacks-gitlab
The primary goal of this epic is to track the work needed to buy us time before migrating the workloads that are moving to Redis Cluster. Specifically, partitioning `redis-repository-cache` away from `redis-cache`.
## Analysis
Looking at [TamLand](https://gitlab-com.gitlab.io/gl-infra/tamland/redis.html) the two Redises closest to saturation are redis-cache and plain redis (AKA redis persistent AKA redis shared state).
As of early December, redis-cache is currently trending upwards and forecasting saturation, whereas redis-persistent is not.
Redis-cache:

Redis-persistent:

The highest priority workload to deal with is redis-cache, which may require more than one subset of data sharded from it.
Previous experience such as https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/841 suggests that we get the biggest impact by partioning off workloads that score high both on [request rate](https://thanos-query.ops.gitlab.net/graph?g0.expr=topk(10%2C%20sum(rate(redis_commands_total%7Benv%3D%22gprd%22%2Ctype%3D%22redis-cache%22%7D%5B10m%5D))%20by%20(fqdn%2Ctype%2Ccmd)%20and%20on(fqdn)%20redis_instance_info%7Brole%3D%22master%22%7D)&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) and [time spent in the request handler](https://thanos-query.ops.gitlab.net/graph?g0.expr=topk(10%2Csum(rate(redis_commands_duration_seconds_total%7Benv%3D%22gprd%22%2Ctype%3D%22redis-cache%22%7D%5B10m%5D))%20by%20(fqdn%2Ctype%2Ccmd)%20and%20on(fqdn)%20redis_instance_info%7Brole%3D%22master%22%7D)&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D).
We have discussed previously [sharding off the repository cache](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/762), and as that is presently 40 - 50% of the commands and keys in redis-cache, [we will be addressing that first](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/860). Because 50% of the current redis-cache CPU would spike to nearly 50% (we're hitting 90 - 95% now), the 20 - 30% potential k8s overhead would put us at risk. For that reason, we're building this infrastructure out on VMs.
For redis-cache the most common command is GET and that is also where the majority of time is spent. This is why https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1994 suggests moving the most common GET key prefix away from redis-cache. Moving this workload [also assists with the future move to redis-cluster](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1992#note_1163520195).
For redis-persistent we see that about 35% of request are for DB load balancing keys. Therefore scalability#2046 suggests moving those keys to a dedicated Redis instance.
## Status 2023-02-17
All cleanup tasks including documentation have been completed. We gained nearly 30% CPU for redis-cache as part of this effort. [We split redis-repository-cache from redis-cache on 2023-01-31](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/860).
epic