2022-06-03: Try out eviction configuration tweaks on redis-cache
Production Change
Change Summary
Redis cache evictions look bursty scalability#1601 causing slight delays in Redis commands that happen during one of these eviction burst, slowing down the requests during those.
There's 2 settings we'd like to adjust in order to see their impact on production workloads:
- Disable
lazyfree-lazy-eviction - Reduce
maxmemory-eviction-tenacityfrom 10 (default -> 500us) to 5 (250us) and then to 1 (50us).
See scalability#1601 (comment 943602636) for the details.
Change Details
- Services Impacted - ServiceRedis (Redis cache).
-
Change Technician -
@reprazent -
Change Reviewer -
@msmiley - Time tracking - 1h (including gathering information)
- Scheduled Start Time - 15:00 UTC 3rd June 2022
- Downtime Component - No
Background notes for running this experiment
This experiment aims to test 2 hypotheses about why evictions occur in large bursts that reduce memory usage more than expected and indirectly cause latency spikes due to the multi-second overhead of this eviction effort.
Each time we experimentally adjust a setting, we must wait for evictions to begin again, so that we can observe the effects of the tuning. Currently evictions occur roughly once every 10 minutes, but the delay varies based on workload and the amount of memory freed by the previous eviction burst.
Evictions begin when redis-cache memory usage rises to 60 GB (its configured maxmemory limit). At that point, evictions can begin, so we must wait until that triggering condition is reached before we can observe the effects of the tuning adjustment.
For the same reason, any ad hoc measurements should start shortly before memory usage reaches maxmemory limit, so we can observe the state transition when evictions begin.
The following graphs show redis memory usage and eviction rate:
The changes in this experiment aim to make these graphs have much shallower and more frequent dips/spikes.
The following series of Redis commands -- CONFIG GET and CONFIG SET -- will be run via sudo gitlab-redis-cli on the redis-cache primary node.
Before and after making each change, show the current values for the redis settings we are experimentally adjusting:
$ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done
Tue 31 May 2022 10:55:30 PM UTC
lazyfree-lazy-eviction
yes
maxmemory-eviction-tenacity
10
At any point, we can either revert these changes or extend the observation window to build confidence in the results, but by the time we close this issue, we will revert these settings to their original state.
During these tuning adjustments we will probably run ad hoc observations, such as the ones detailed below in the "Additional observations" section.
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - ~1h leaving the instance as we found it
-
Set label changein-progress /label ~change::in-progress -
SSH to the primary redis-cache node (currently redis-cache-01). Confirm this node is in the primary role:$ ssh redis-cache-01-db-gprd.c.gitlab-production.internal $ sudo gitlab-redis-cli role | head -n1 master
We will make one change at a time, and leave it in place for long enough to observe at least one round of evictions.
Tune lazyfree-lazy-eviction
Disable lazy evictions:
-
Capture settings before applying change: $ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done -
Apply the change: $ date ; sudo gitlab-redis-cli CONFIG SET lazyfree-lazy-eviction no -
Capture settings after applying change: $ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done -
Wait for evictions to begin. Watch the graphs for redis memory usage and eviction rate. -
Start capturing ad hoc observations ( perf,funccount,funclatency, etc.) shortly before evictions begin (i.e. before memory usage climbs back up to 60 GB).
Re-enable lazy evictions
Revert the above tuning adjustment after getting measurements from either:
- at least one large eviction burst
- at least several minutes of shorter more frequent eviction activity
Steps:
-
Capture settings before applying change: $ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done -
Apply the change: $ date ; sudo gitlab-redis-cli CONFIG SET lazyfree-lazy-eviction yes -
Capture settings after applying change: $ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done
Tune maxmemory-eviction-tenacity
Reduce tenacity from 10 (500 us) to 5 (250 us)
-
Capture settings before applying change: $ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done -
Apply the change: $ date ; sudo gitlab-redis-cli CONFIG SET maxmemory-eviction-tenacity 5 -
Capture settings after applying change: $ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done -
Wait for evictions to begin. Watch the graphs for redis memory usage and eviction rate. -
Start capturing ad hoc observations ( perf,funccount,funclatency, etc.) shortly before evictions begin (i.e. before memory usage climbs back up to 60 GB).
Reduce tenacity from 5 (250 us) to 1 (50 us)
-
Capture settings before applying change: $ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done -
Apply the change: $ date ; sudo gitlab-redis-cli CONFIG SET maxmemory-eviction-tenacity 1 -
Capture settings after applying change: $ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done -
Wait for evictions to begin. Watch the graphs for redis memory usage and eviction rate. -
Start capturing ad hoc observations ( perf,funccount,funclatency, etc.) shortly before evictions begin (i.e. before memory usage climbs back up to 60 GB).
Revert tenacity to 10 (500 us)
Revert the above tuning adjustment after getting measurements from either:
- at least one large eviction burst
- at least several minutes of shorter more frequent eviction activity
Steps:
-
Capture settings before applying change: $ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done -
Apply the change: $ date ; sudo gitlab-redis-cli CONFIG SET maxmemory-eviction-tenacity 10 -
Capture settings after applying change: $ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done
Conclude experiment
At the conclusion of this experiment, we should restore the original values of the 2 tunable settings. Based on the results, we can apply these changes persistently as a follow-up change issue.
-
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 2m
-
CONFIG SET CONFIG SET lazyfree-lazy-eviction no -
CONFIG SET maxmemory-eviction-tenacity 10
Monitoring
Key metrics to observe
- Redis-cache overview: https://dashboards.gitlab.net/d/redis-cache-main/redis-cache-overview?orgId=1 (Specifically RPS).
- Spiky redis durations for requests: https://log.gprd.gitlab.net/goto/c592a6c0-e0f3-11ec-aade-19e9974a7229
- Redis memory usage on redis-cache-01
- Redis eviction rate on redis-cache-01
Additional observations
Additionally, we will run the following ad hoc observations around the time when evictions are expected to occur after each tuning adjustment. Depending on results, we may skip any of these, but they are all expected to be tolerably low overhead so that we can run them for several minutes if needed.
Capture perf profile for 10 minutes at 99 Hz for redis-server process
The perf-script output file can be reviewed in Flamescope to find the precise timespan of eviction activity.
We can observe CPU utilization of the main thread, delegation of work to background threads, frequency of samples in eviction-related code paths, etc.
Note: This script is identical to /usr/local/bin/perf_flamegraph_for_pid.sh, except that its DURATION_SECONDS is set to 600 instead of 60.
$ ~msmiley/perf_flamegraph_for_pid.10_minutes.sh $( pgrep -o redis-server )
Capture latency distribution for redis function performEvictions
When we reduce tenacity below 10, the top bucket should shift to "256 -> 511". For background and comparison results, see: scalability#1601 (comment 857907521)
$ date ; sudo funclatency-bpfcc --microseconds --timestamp --interval 1 --duration 600 --pid $( pgrep -o redis-server ) '/opt/gitlab/embedded/bin/redis-server:performEvictions' ; date
Count the number of slow performEvictions calls
Adjust the threshold (--min-us 500) based on the tenacity value (50 us * tenacity):
- tenacity 10 = 500 us
- tenacity 5 = 250 us
- tenacity 1 = 50 us
For background and comparison results, see: scalability#1601 (comment 857934814)
$ date ; sudo timeout 300 funcslower-bpfcc --min-us 500 --time --user-stack --pid $( pgrep -o redis-server ) '/opt/gitlab/embedded/bin/redis-server:performEvictions' | tee funcslower.500us.performEvictions.txt
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncalland this issue and await their acknowledgement.) - Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managersand this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

