Migrate repository cache from redis-cache to new shard on VMs (#860) · Epics · GitLab Infrastructure Team

Migrate repository cache from redis-cache to new shard on VMs

[Repository calls consist of 40 - 50% of commands](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/857#note_1197165845) to redis-cache and will be the first workload migrated off redis-cache to its own shard as part of the redis functional partitioning. Because of the size and concern about future CPU growth while we work on figuring out redis-cluster, we will be migrating this workload to a VM/sentinel based redis infrastructure, similar to what redis-cache has today. We did a subset of this work previously under https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/762, however, that was moving to k8s rather than VMs. In the hopes of maintaining that history for reference, I'm opening a new epic for this work. It has been some time since we built out an instance on VMs, so I used https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1246 as a guideline. ## Conclusions Using tamland week over week to analyze how much time we had left before saturation worked perfectly. We received the first alert for redis CPU saturation on 2023-01-31 and were already planning the rollout of redis-repository-cache, which we moved up to that evening. This functional partitioning effort provided a significant gain, with an average 30% drop in CPU utilization on redis-cache. Week over week comparison of primary CPU can be seen below. ![Screenshot_2023-02-02_at_1.07.59_PM](/uploads/26e1f6f540c8006cdbddb4ec83138306/Screenshot_2023-02-02_at_1.07.59_PM.png) [source](https://thanos-query.ops.gitlab.net/graph?g0.expr=avg_over_time(gitlab_component_ops%3Arate_5m%7Bcomponent%3D%22primary_server%22%2Cenv%3D%22gprd%22%2Cenvironment%3D%22gprd%22%2Cmonitor%3D%22global%22%2Ctype%3D%22redis-cache%22%7D%5B5m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g1.expr=max_over_time(gitlab_component_saturation%3Aratio%7Benv%3D%22gprd%22%2Cenvironment%3D%22gprd%22%2Ctype%3D%22redis-cache%22%2Ccomponent%3D%22redis_primary_cpu%22%7D%5B1h%5D)%20&g1.tab=0&g1.stacked=0&g1.range_input=6h&g1.max_source_resolution=0s&g1.deduplicate=1&g1.partial_response=0&g1.store_matches=%5B%5D&g2.expr=label_replace(avg_over_time(gitlab_component_saturation%3Aratio%7Benv%3D%22gprd%22%2Cenvironment%3D%22gprd%22%2Ctype%3D%22redis-cache%22%2Ccomponent%3D%22redis_primary_cpu%22%7D%5B2h%5D)%2C%20%27time%27%2C%20%27now%27%2C%20%27%27%2C%20%27%27)%0Aor%0Alabel_replace(avg_over_time(gitlab_component_saturation%3Aratio%7Benv%3D%22gprd%22%2Cenvironment%3D%22gprd%22%2Ctype%3D%22redis-cache%22%2Ccomponent%3D%22redis_primary_cpu%22%7D%5B2h%5D%20offset%201w)%2C%20%22time%22%2C%20%27offset-1w%27%2C%20%27%27%2C%20%27%27)&g2.tab=0&g2.stacked=0&g2.range_input=1w&g2.max_source_resolution=0s&g2.deduplicate=1&g2.partial_response=0&g2.store_matches=%5B%5D) It also allowed us to test out the feature flag based MultiStore configuration that we will be using for future migrations. [This has created a discussion topic on whether or not the current behaviour is the best one.](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2161) ## Status 2023-02-08 [Production rollout occurred on 2023-01-31](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8309) with [one minor praefect incident](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8337). Redis-Cache CPU utilization dropped from 95 - 97% at peak to 65%. Redis-repository-cache CPU utilization is 50% at peak. This epic is now closed.

epic