Investigate redis-cache latency spikes driven by eviction spikes
Quick reference
The blog post about this issue's research and impact gives a tidy narrative, highlighting the milestones, iterative approach, and analytical methods we used: https://about.gitlab.com/blog/2022/11/28/how-we-diagnosed-and-resolved-redis-latency-spikes/
Results of the changes we made based on these findings:
Topic of investigation
Why is redis-cache still experiencing abrupt bursts of evictions, even after upgrading to Redis 6.2?
We started this investigation over here: #1567 (comment 852705743)
But it seems like it warrants a dedicated issue to collect findings.
Background
Our redis-cache exhibits chronic latency spikes that correlate with bursts of key eviction. Upgrading to Redis 6.2 was expected to improve this, as 6.2 introduced a smoothing behavior to pace the evictions more gradually. Since the upgrade yesterday, the eviction counter shows that when evictions are needed, they now typically begin at a modest rate but still culminate in an abrupt large spike.
A typical spike in eviction rate after upgrading to Redis 6.2:
Why is it important to get to the root of the latency spikes?
Prior to the version upgrade, we saw bursts of traffic on Redis every minute which caused CPU saturation (reference).
In response to the CPU saturation, we increased the server-side idle timeout to reduce connection churn. We also upgraded instances types to C2 to gain more headroom. We also applied a rate limit to /api/v4/groups/:id/projects
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1906, looked at the n+1 on /api/v4/groups/:id/projects
(gitlab-org/gitlab#213797, and fixed the caching issue gitlab-org/gitlab#214510 (closed))
It is important to understand these spikes because we cannot continue to add CPU headroom to accommodate them.
Conclusions
Root cause: What drives the eviction burst cycle? What ends it? Why does it free so much memory?
Short version:
The following feedback loop drives the performance regression that recurs roughly every 5-7 minutes during the daily peak workload:
- Redis memory usage reaches saturation. Evictions occur each time
maxmemory
is exceeded, freeing just enough memory to get back under the limit. - Eviction overhead nudges the redis main thread's already high CPU utilization to its saturation point.
- The CPU saturation leads to significantly reduced redis throughput (e.g. 60K/s -> 5K/s). Request arrival rate exceeds response rate.
- Accumulating a request backlog increases memory pressure and drives more evictions, because a single memory pool handles both key storage and client-related scratch space.
- When enough clients are stalled that request arrival rate falls back below response rate, the eviction burst ends, and the backlog starts shrinking.
- As the request backlog shrinks, the memory that had been used as scratch space gets freed again. This is why hundreds of MB get steadily freed during the tens of seconds immediately after the end of an eviction burst.
- When the backlog is complete, the total used memory is well below the
maxmemory
limit. Redis latency remains low until memory usage rises back up to the saturation point, which typically takes several minutes (and during which performance is good).
Long version:
The recurring pattern of eviction bursts causes periodic severe performance degradation for redis clients, manifesting as client-facing apdex dips and error rate spikes.
The eviction burst begins due to redis memory usage reaching its configured limit (maxmemory
= 60 GB). Once evictions start, their overhead pushes the already limited CPU capacity to its saturation point. This throughput reduction completes a feedback loop that sustains evictions for several seconds, briefly summarized below:
- When memory usage exceeds the configured
maxmemory
limit, the overhead of performing evictions effectively reduces client throughput. - Due to that throughput reduction, the arrival rate of new requests now exceeds response rate. Redis starts accumulating a backlog of client requests.
- Accumulating a backlog drives more evictions, because the same memory pool is used for both key storage and scratch space for client buffers, etc. Redis has to evict keys to make room for client buffers, etc.
During this phase, arrival rate of new requests exceeds the response rate of completed requests. Request and response payloads compete for space in the same memory pool as stored keys. Consequently, accumulating a backlog of requests drives additional key evictions. This feedback loop self-perpetuates until enough clients are stalled that the incoming request rate falls back below the outgoing response rate.
Immediately after the eviction-heavy saturation phase ends, memory usage rapidly drops as Redis handles its backlog of client requests. This memory reclaim phase usually takes tens of seconds and reclaims a few hundred megabytes of memory.
Afterwards, because memory usage is below the maxmemory
limit, Redis latency is back to normal. However, that memory usage slowly climbs back up towards maxmemory
, because the rate of adding keys exceeds the rate of expiring keys via TTL and reclaiming their memory. It takes several minutes to reach saturation again (varying with the time of day's write rate versus expiry rate). During that unsaturated timespan, performance is what we expect, but upon reaching saturation, the eviction burst cycle starts again.
For more details, see:
- #1601 (comment 989997850) - Behavior model: Throughput reduction drives increased memory pressure and key evictions until enough clients stall
- #1601 (comment 982623949) - 3-phase cycle of bursty evictions
- #1601 (comment 982174108) - Summary of experiment results finding that rapid memory reclaim discretely follows the end of evictions. This discovery lead to the 3-phase model noted above.
-
#1601 (comment 982498636) - During eviction burst, many calls to
performEvictions
each free a little memory.
Remedy: How can we reduce or avoid the performance impact?
As described above, the performance regression is driven by a feedback loop where: memory saturation -> CPU saturation -> throughput falls below request arrival rate -> extra memory pressure.
Some options to break that cycle:
- Avoid memory saturation, so the eviction feedback loop does not start.
- Reduce TTL and/or increase
maxmemory
, so that peak daily memory usage is below themaxmemory
saturation point. (This is our top pick as a short-term solution.) - Split the keyspace, moving some keys to a separate redis instance. (This is our preferred medium-term solution.)
- Use a separate memory pool for key storage versus client buffers. (Introducing this isolation boundary would trade the current failure mode for new ones; not clear if this would be worthwhile. Idea is on hold for now.)
- Reduce TTL and/or increase
- Avoid CPU saturation during evictions, so request throughput does not drop.
- Reduce CPU overhead for evictions. (We tested some of these options and concluded the benefit was inadequate.)
- Reduce client request rate. (This seems impractical at a glance.)
- Increase CPU capacity for the redis main thread. (Already done.)
Avoiding evictions is the simplest solution. If the memory usage stays below the maxmemory
limit, this avoids starting the feedback loop described above. To do so, we plan to reduce TTL for some stored keys.
For more details, see:
- #1601 (comment 1014582669) - Summary of next steps, as of 2022-07-04. This was the outcome of our planning meeting to discuss potentially viable solutions after having completed the pathology analysis.
- #1601 (comment 990891708) - This thread includes the discussion and initial results of the keyspace analysis, which aims to inform how we choose keys and TTL values to adjust.
- #1601 (comment 1015069256) - Keyspace analysis results, summing key count and idle time by key naming pattern.
Next Steps
- Project to add a new Redis shard for caching Repository information &762 (closed) (15.3% of redis-cache storage space at the time of writing)
- Modify redis caching strategies to reduce maxmemory events on redis-cache instances. gitlab-org&8419