Redis-cache is the Redis deployment we use as the backing store for Rails.cache. We are currently using Redis 6.0. It is configured as an LRU cache using the maxmemory directive. As we have known for quite some time now (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9420, #862 (closed), #1556 (closed)), the way maxmemory is implemented in Redis 6.0 and earlier leads to latency spikes. This is because the implementation favours throughput (evict old keys as fast as possible) over latency. Thankfully, in Redis 6.2 there is a change in the implementation of maxmemory so that it now favours latency over throughput.
Because of this we should upgrade redis-cache to Redis 6.2. We probably want to upgrade all our Redis instances to the same version, but only redis-cache suffers from these maxmemory latency spikes so we should upgrade it first. None of the other instances uses maxmemory.
The upgrade to Redis 6.0 was carried out by Reliability and tracked in &395. We can probably use that epic as a starting point.
Redis 6.2 doesn't jive with our redis configuration specified by hostname, so we had to revert to using IPs. See production#6360 (comment 849952257) for more details.
We should let 6.2 running on gstg for at least a couple of days. I'll prepare the change for gprd, incorporating the lessons we took from the staging attempt.
We completed the upgrade procedure on gprd on production#6397
@jarv I can confirm the symlinks are now on the production nodes as well:
alejandro@redis-cache-03-db-gprd.c.gitlab-production.internal:~$ ls -l /opt/gitlab/embedded/service/gitlab-rails/ee/lib/ee/gitlab/ci/parsers/security/validators/schemas/14*lrwxrwxrwx 1 root root 83 Feb 4 09:49 /opt/gitlab/embedded/service/gitlab-rails/ee/lib/ee/gitlab/ci/parsers/security/validators/schemas/14.0.4 -> ../../../../../../../../../lib/gitlab/ci/parsers/security/validators/schemas/14.0.4lrwxrwxrwx 1 root root 83 Feb 4 09:49 /opt/gitlab/embedded/service/gitlab-rails/ee/lib/ee/gitlab/ci/parsers/security/validators/schemas/14.0.5 -> ../../../../../../../../../lib/gitlab/ci/parsers/security/validators/schemas/14.0.5lrwxrwxrwx 1 root root 83 Feb 4 09:49 /opt/gitlab/embedded/service/gitlab-rails/ee/lib/ee/gitlab/ci/parsers/security/validators/schemas/14.0.6 -> ../../../../../../../../../lib/gitlab/ci/parsers/security/validators/schemas/14.0.6lrwxrwxrwx 1 root root 83 Feb 4 09:49 /opt/gitlab/embedded/service/gitlab-rails/ee/lib/ee/gitlab/ci/parsers/security/validators/schemas/14.1.0 -> ../../../../../../../../../lib/gitlab/ci/parsers/security/validators/schemas/14.1.0
Do we want to try @tkuah's fix? I'm assuming since the deployer doesn't touch these nodes it shouldn't be a problem and we can wait for an omnibus package fix, if such will come.
I'm assuming since the deployer doesn't touch these nodes it shouldn't be a problem and we can wait for an omnibus package fix, if such will come.
Yeah, I think so. @tkuah your fix is to prevent this from happening I think, but will this be handled in a later omnibus upgrade automatically? If not it looks like we should probably do a manual cleanup.
Looks like the behavior of the save config setting changes on 6.2, where our current unspecified setting translate to using the default save values, which we don't want for redis-cache. We'll set it to an empty string to prevent this on https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1402
Having removed the unwanted SAVE directives from the redis-server's runtime config (as well as in redis.conf and gitlab.rb), the RDB backups are no longer being run nearly continuously, and the apdex has recovered.
Oh, and from our pairing session, @igorwwwwwwwwwwwwwwwwwwww noticed that we need to remember to make the same fix for the ratelimiting redis shard (in the chef roles for gstg, pre, and gprd):
I've been thinking about how we only noticed the change in behavior because an engineer was watching the result of this upgrade for a different project. It seems that the change in apdex pattern wasn't enough to trigger an alert.
Should we consider holding change issues for upgrades open for 24 hours so that any change in behavior can be observed? Or is there a different metric we should have been observing on that change issue at the time?
Noting what we discussed on the team demo: We believe our alerting would've caught the Apdex degradation had Jacob not detected the issue. I think in short, if we do find that a CR had a negative impact that didn't trigger alerting, the action should be to tweak the alerting threshold.
Following up on @jacobvosmaer-gitlab finding that we unexpecteedly still have large spikes in redis-cache's key eviction count even after the upgrade from redis 6.0 to 6.2.
To better understand what is allowing the burst to spike so high rather than remain smoothed, we want to capture a CPU profile of redis-server while it is experiencing an eviction burst.
Eviction rate spike during the profile
Here we capture a 10-minute profile, with an eviction spike near the middle. This lets us see the quiet periods before and after, plus the initially modest eviction rate that we think represents Redis 6.2's newly throttled eviction behavior.
This profile captured 497 stack traces per second from just the redis-server process.
Note: If RDB backups were enabled, this would also have captured the forked child process, but since that is disabled, this profile should only include the threads of the main redis-server process.
PRODUCTION PRIMARY-REDIS msmiley@redis-cache-01-db-gprd.c.gitlab-production.internal:~$ date ; sudo perf record -g --pid $( pgrep -o redis-server ) --freq 497 -- sleep 600Thu 24 Feb 2022 04:58:11 PM UTC[ perf record: Woken up 121 times to write data ][ perf record: Captured and wrote 33.928 MB perf.data (222606 samples) ]PRODUCTION PRIMARY-REDIS msmiley@redis-cache-01-db-gprd.c.gitlab-production.internal:~$ stat perf.data | grep ModifyModify: 2022-02-24 17:08:11.658303778 +0000PRODUCTION PRIMARY-REDIS msmiley@redis-cache-01-db-gprd.c.gitlab-production.internal:~$ sudo perf script --header | gzip > $( hostname -s ).eviction_burst.10m_profile_ending_at_20220224_1708.perf-script.txt.gz
The main thing we want to see is the timeline for the main redis thread. So we need to first filter the raw results to include just that thread, and then load that extract into Flamescope.
I need to switch focus for a little bit, so I'll return to this later today and share more results.
As a teaser, though, here is the complete unfiltered flamegraph for the whole 10-minute timespan. (Later we will get to see flamegraphs for just during the eviction burst event, which will probably be easier to interpret, so stay tuned!)
Ok, just a super quick look before my next meeting:
I did not yet filter to just the main redis thread, but we can probably still spot the CPU burst associated with the evictions. I'll double-check this later. Consider this a preview.