Functional Partitioning for Repository Cache to mitigate Redis saturation forecasts
DRI: @stejacks-gitlab The primary goal of this epic is to track the work needed to buy us time before migrating the workloads that are moving to Redis Cluster. Specifically, partitioning `redis-repository-cache` away from `redis-cache`. ## Analysis Looking at [TamLand](https://gitlab-com.gitlab.io/gl-infra/tamland/redis.html) the two Redises closest to saturation are redis-cache and plain redis (AKA redis persistent AKA redis shared state). As of early December, redis-cache is currently trending upwards and forecasting saturation, whereas redis-persistent is not. Redis-cache: ![image](/uploads/fbea229468851991e8a51d550b21ab0c/image.png) Redis-persistent: ![image](/uploads/1d876216caab2e5de54c60f500bb0640/image.png) The highest priority workload to deal with is redis-cache, which may require more than one subset of data sharded from it. Previous experience such as https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/841 suggests that we get the biggest impact by partioning off workloads that score high both on [request rate](https://thanos-query.ops.gitlab.net/graph?g0.expr=topk(10%2C%20sum(rate(redis_commands_total%7Benv%3D%22gprd%22%2Ctype%3D%22redis-cache%22%7D%5B10m%5D))%20by%20(fqdn%2Ctype%2Ccmd)%20and%20on(fqdn)%20redis_instance_info%7Brole%3D%22master%22%7D)&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) and [time spent in the request handler](https://thanos-query.ops.gitlab.net/graph?g0.expr=topk(10%2Csum(rate(redis_commands_duration_seconds_total%7Benv%3D%22gprd%22%2Ctype%3D%22redis-cache%22%7D%5B10m%5D))%20by%20(fqdn%2Ctype%2Ccmd)%20and%20on(fqdn)%20redis_instance_info%7Brole%3D%22master%22%7D)&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D). We have discussed previously [sharding off the repository cache](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/762), and as that is presently 40 - 50% of the commands and keys in redis-cache, [we will be addressing that first](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/860). Because 50% of the current redis-cache CPU would spike to nearly 50% (we're hitting 90 - 95% now), the 20 - 30% potential k8s overhead would put us at risk. For that reason, we're building this infrastructure out on VMs. For redis-cache the most common command is GET and that is also where the majority of time is spent. This is why https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1994 suggests moving the most common GET key prefix away from redis-cache. Moving this workload [also assists with the future move to redis-cluster](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1992#note_1163520195). For redis-persistent we see that about 35% of request are for DB load balancing keys. Therefore scalability#2046 suggests moving those keys to a dedicated Redis instance. ## Status 2023-02-17 All cleanup tasks including documentation have been completed. We gained nearly 30% CPU for redis-cache as part of this effort. [We split redis-repository-cache from redis-cache on 2023-01-31](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/860).
epic