Eliminate memory saturation in redis-cache instances (#764) · Epics · GitLab Infrastructure Team

Eliminate memory saturation in redis-cache instances

The redis-cache instances have been having chronic latency issues due to key eviction at maxmemory times. [This latency impacts the error budgets for stage groups.](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_1017758435) [The first attempt to resolve this issue was to upgrade redis from 6.0 to 6.2.](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1567) When that upgrade did not resolve the issue, [an investigation issue ](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601)was created which led to the short and medium term fixes in this epic. This epic will eliminate the memory saturation in redis-cache instances, the outcome should be that we re-enable the [`redis_memory` saturation point for Redis cache in the runbooks](https://gitlab.com/gitlab-com/runbooks/blob/055bb6dc13d9afc67cf66a598949f5e9dd92e23c/libsonnet/saturation-monitoring/redis_memory.libsonnet#L73). There are currently two sub-epics for short and medium term fixes. ### Follow up items 1. [x] Locate items blocked on the maxmemory problem and notify their owners that the problem is resolved. For example: https://gitlab.com/gitlab-org/gitlab/-/issues/365575 ### 2022-10-13 Status Redis-cache memory utilization is now consistently below the saturation point. A series of TTL reductions dropped redis memory usage below its `maxmemory` limit. This successfully resolved the recurring apdex dips. Eviction bursts no longer occur, so the redis main thread is no longer saturating its CPU to evict keys to get back under its memory limit. Instead, keys are expired gracefully (and cheaply), without causing latency spikes. After driving memory demand a little below its saturation point (peaking at around 93% of `maxmemory`), we reprovisioned the VMs and raised `maxmemory` to add more headroom for future growth. This capacity increase (60 GB -> 120 GB) on 2022-08-26 ensured that the eviction-driven latency spikes would not return when future features or organic growth increased memory demand again. However, the larger machine introduced a new challenge: multiple NUMA nodes. Tuning the kernel to disable foreground numa page migrations returned redis latency to acceptable level. At this point, memory demand stays well below the saturation point, eliminating the latency spikes that were burning error budgets for the stage groups. The trend has remained stable and healthy over the last several weeks of observation. The following graph shows redis-cache memory usage over the last 2 months, with annotations describing the milestones when latency spikes ended and when saturation margin became wide enough to be considered safe: ![Screenshot_from_2022-10-13_23-01-35](/uploads/e0e5e27d53ff0b9881016f68fbc07131/Screenshot_from_2022-10-13_23-01-35.png) [source](https://dashboards.gitlab.net/d/redis-cache-main/redis-cache-overview?from=1659312000000&to=1664582400000&var-PROMETHEUS_DS=Global&var-environment=gprd&orgId=1&viewPanel=105)

epic