2022-06-03: Try out eviction configuration tweaks on redis-cache

Production Change

Change Summary

Redis cache evictions look bursty scalability#1601 causing slight delays in Redis commands that happen during one of these eviction burst, slowing down the requests during those.

There's 2 settings we'd like to adjust in order to see their impact on production workloads:

Disable lazyfree-lazy-eviction
Reduce maxmemory-eviction-tenacity from 10 (default -> 500us) to 5 (250us) and then to 1 (50us).

See scalability#1601 (comment 943602636) for the details.

Change Details

Services Impacted - ServiceRedis (Redis cache).
Change Technician - @reprazent
Change Reviewer - @msmiley
Time tracking - 1h (including gathering information)
Scheduled Start Time - 15:00 UTC 3rd June 2022
Downtime Component - No

Background notes for running this experiment

This experiment aims to test 2 hypotheses about why evictions occur in large bursts that reduce memory usage more than expected and indirectly cause latency spikes due to the multi-second overhead of this eviction effort.

Each time we experimentally adjust a setting, we must wait for evictions to begin again, so that we can observe the effects of the tuning. Currently evictions occur roughly once every 10 minutes, but the delay varies based on workload and the amount of memory freed by the previous eviction burst.

Evictions begin when redis-cache memory usage rises to 60 GB (its configured maxmemory limit). At that point, evictions can begin, so we must wait until that triggering condition is reached before we can observe the effects of the tuning adjustment.

For the same reason, any ad hoc measurements should start shortly before memory usage reaches maxmemory limit, so we can observe the state transition when evictions begin.

The following graphs show redis memory usage and eviction rate:

The changes in this experiment aim to make these graphs have much shallower and more frequent dips/spikes.

The following series of Redis commands -- CONFIG GET and CONFIG SET -- will be run via sudo gitlab-redis-cli on the redis-cache primary node.

Before and after making each change, show the current values for the redis settings we are experimentally adjusting:

$ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done
Tue 31 May 2022 10:55:30 PM UTC
lazyfree-lazy-eviction
yes
maxmemory-eviction-tenacity
10

At any point, we can either revert these changes or extend the observation window to build confidence in the results, but by the time we close this issue, we will revert these settings to their original state.

During these tuning adjustments we will probably run ad hoc observations, such as the ones detailed below in the "Additional observations" section.

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - ~1h leaving the instance as we found it

Set label changein-progress /label ~change::in-progress

SSH to the primary redis-cache node (currently redis-cache-01). Confirm this node is in the primary role:

$ ssh redis-cache-01-db-gprd.c.gitlab-production.internal

$ sudo gitlab-redis-cli role | head -n1
master

We will make one change at a time, and leave it in place for long enough to observe at least one round of evictions.

Tune `lazyfree-lazy-eviction`

Disable lazy evictions:

Capture settings before applying change:

$ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done

Apply the change:

$ date ; sudo gitlab-redis-cli CONFIG SET lazyfree-lazy-eviction no

Capture settings after applying change:

$ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done

Wait for evictions to begin. Watch the graphs for redis memory usage and eviction rate.
Start capturing ad hoc observations (perf, funccount, funclatency, etc.) shortly before evictions begin (i.e. before memory usage climbs back up to 60 GB).

Re-enable lazy evictions

Revert the above tuning adjustment after getting measurements from either:

at least one large eviction burst
at least several minutes of shorter more frequent eviction activity

Steps:

Capture settings before applying change:

$ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done

Apply the change:

$ date ; sudo gitlab-redis-cli CONFIG SET lazyfree-lazy-eviction yes

Capture settings after applying change:

$ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done

Tune `maxmemory-eviction-tenacity`

Reduce tenacity from 10 (500 us) to 5 (250 us)

Capture settings before applying change:

$ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done

Apply the change:

$ date ; sudo gitlab-redis-cli CONFIG SET maxmemory-eviction-tenacity 5

Capture settings after applying change:

$ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done

Wait for evictions to begin. Watch the graphs for redis memory usage and eviction rate.
Start capturing ad hoc observations (perf, funccount, funclatency, etc.) shortly before evictions begin (i.e. before memory usage climbs back up to 60 GB).

Reduce tenacity from 5 (250 us) to 1 (50 us)

Capture settings before applying change:

$ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done

Apply the change:

$ date ; sudo gitlab-redis-cli CONFIG SET maxmemory-eviction-tenacity 1

Capture settings after applying change:

$ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done

Wait for evictions to begin. Watch the graphs for redis memory usage and eviction rate.
Start capturing ad hoc observations (perf, funccount, funclatency, etc.) shortly before evictions begin (i.e. before memory usage climbs back up to 60 GB).

Revert tenacity to 10 (500 us)

Revert the above tuning adjustment after getting measurements from either:

at least one large eviction burst
at least several minutes of shorter more frequent eviction activity

Steps:

Capture settings before applying change:

$ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done

Apply the change:

$ date ; sudo gitlab-redis-cli CONFIG SET maxmemory-eviction-tenacity 10

Capture settings after applying change:

$ date ; for CONFIG in lazyfree-lazy-eviction maxmemory-eviction-tenacity ; do sudo gitlab-redis-cli --raw config get "$CONFIG" ; done

Conclude experiment

At the conclusion of this experiment, we should restore the original values of the 2 tunable settings. Based on the results, we can apply these changes persistently as a follow-up change issue.

Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 2m

CONFIG SET CONFIG SET lazyfree-lazy-eviction no
CONFIG SET maxmemory-eviction-tenacity 10

Monitoring

Key metrics to observe

Redis-cache overview: https://dashboards.gitlab.net/d/redis-cache-main/redis-cache-overview?orgId=1 (Specifically RPS).
Spiky redis durations for requests: https://log.gprd.gitlab.net/goto/c592a6c0-e0f3-11ec-aade-19e9974a7229
Redis memory usage on redis-cache-01
Redis eviction rate on redis-cache-01

Additional observations

Additionally, we will run the following ad hoc observations around the time when evictions are expected to occur after each tuning adjustment. Depending on results, we may skip any of these, but they are all expected to be tolerably low overhead so that we can run them for several minutes if needed.

Capture perf profile for 10 minutes at 99 Hz for redis-server process

The perf-script output file can be reviewed in Flamescope to find the precise timespan of eviction activity.

We can observe CPU utilization of the main thread, delegation of work to background threads, frequency of samples in eviction-related code paths, etc.

Note: This script is identical to /usr/local/bin/perf_flamegraph_for_pid.sh, except that its DURATION_SECONDS is set to 600 instead of 60.

$ ~msmiley/perf_flamegraph_for_pid.10_minutes.sh $( pgrep -o redis-server )

Capture latency distribution for redis function `performEvictions`

When we reduce tenacity below 10, the top bucket should shift to "256 -> 511". For background and comparison results, see: scalability#1601 (comment 857907521)

$ date ; sudo funclatency-bpfcc --microseconds --timestamp --interval 1 --duration 600 --pid $( pgrep -o redis-server ) '/opt/gitlab/embedded/bin/redis-server:performEvictions' ; date

Count the number of slow `performEvictions` calls

Adjust the threshold (--min-us 500) based on the tenacity value (50 us * tenacity):

tenacity 10 = 500 us
tenacity 5 = 250 us
tenacity 1 = 50 us

For background and comparison results, see: scalability#1601 (comment 857934814)

$ date ; sudo timeout 300 funcslower-bpfcc --min-us 500 --time --user-stack --pid $( pgrep -o redis-server ) '/opt/gitlab/embedded/bin/redis-server:performEvictions' | tee funcslower.500us.performEvictions.txt

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Jun 03, 2022 by Matt Smiley

2022-06-03: Try out eviction configuration tweaks on redis-cache

Production Change

Change Summary

Change Details

Background notes for running this experiment

Detailed steps for the change

Change Steps - steps to take to execute the change

Tune lazyfree-lazy-eviction

Disable lazy evictions:

Re-enable lazy evictions

Tune maxmemory-eviction-tenacity

Reduce tenacity from 10 (500 us) to 5 (250 us)

Reduce tenacity from 5 (250 us) to 1 (50 us)

Revert tenacity to 10 (500 us)

Conclude experiment

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Monitoring

Key metrics to observe

Additional observations

Capture perf profile for 10 minutes at 99 Hz for redis-server process

Capture latency distribution for redis function performEvictions

Count the number of slow performEvictions calls

Change Reviewer checklist

Change Technician checklist

Tune `lazyfree-lazy-eviction`

Tune `maxmemory-eviction-tenacity`

Capture latency distribution for redis function `performEvictions`

Count the number of slow `performEvictions` calls