Resource saturation: redis_memory on redis-cache

added SRE:On-call oncall + 1 deleted label

added GitLab.com Resource Saturation label

changed the description

mentioned in issue production#1311 (closed)

changed milestone to %W45 Dev & Ops

redis-cache-01-db-gprd.c.gitlab-production.internal:/home/amar# /opt/gitlab/embedded/bin/redis-cli -a $REDIS_MASTER_AUTH config get maxmemory*
1) "maxmemory"
2) "64424509440"
3) "maxmemory-samples"
4) "5"
5) "maxmemory-policy"
6) "allkeys-lru"

redis-cache-01-db-gprd.c.gitlab-production.internal:/home/amar# /opt/gitlab/embedded/bin/redis-cli -a $REDIS_MASTER_AUTH info

# Memory
used_memory:64271675832
used_memory_human:59.86G
used_memory_rss:85498040320
used_memory_rss_human:79.63G
used_memory_peak:64316145096
used_memory_peak_human:59.90G
total_system_memory:109830397952
total_system_memory_human:102.29G
used_memory_lua:50176
used_memory_lua_human:49.00K
maxmemory:64424509440
maxmemory_human:60.00G
maxmemory_policy:allkeys-lru
mem_fragmentation_ratio:1.33
mem_allocator:jemalloc-4.0.3

Few things we can notice from the above:

maxmemory is definitely set at 60GB
But it is the used_memory_rss_human that is consumed at 79.63G
mem_fragmentation_ratio is 1.33.

According to https://redis.io/topics/memory-optimization:

Redis will not always free up (return) memory to the OS when keys are removed. This is not something special about Redis, but it is how most malloc() implementations work. For example if you fill an instance with 5GB worth of data, and then remove the equivalent of 2GB of data, the Resident Set Size (also known as the RSS, which is the number of memory pages consumed by the process) will probably still be around 5GB, even if Redis will claim that the user memory is around 3GB. This happens because the underlying allocator can't easily release the memory. For example often most of the removed keys were allocated in the same pages as the other keys that still exist.

The previous point means that you need to provision memory based on your peak memory usage. If your workload from time to time requires 10GB, even if most of the times 5GB could do, you need to provision for 10GB.

So it looks like prior to September, 2018 we had the maxmemory set to 80 and changed it to 60. This explains why the RSS got up close to 80GB especially because if there are keys in use that are still on same pages as freed ones => causing fragmentation.

Before I do production change, I went to gstg and tried maxmemory change. In gstg we also have 60GB for maxmemory. However, the used_memory_rss was lower than the used_memory (which is not similar to what we have in production). But, in order to test to make sure the used_memory_rss goes down when maxmemory is changed I changed the maxmemory from 60GB to 40GB via config set maxmemory 42949672960. This worked and used_memory_rss went down as expected. I bumped the maxmemory back to 60GB. (This was all in staging)

I think what we can try is change the maxmemory to a lower value (say 59GB) on each redis-cache-0[1-3]-db-gprd host, see that the RSS memory goes down, and then bump it back to 60GB. This should restrict the RSS growing beyond the used_memory UNLESS fragmentation occurs again. (If fragmentation does occur, regardless of what lower value we set the maxmemory to we would still run into the issue).

I will write up a production CR for it.

Really appreciate this indepth research @aamarsanaa Thank you!

assigned to @aamarsanaa

@andrewn - Also this one... Production CR: production#1337 (closed)

mentioned in issue on-call-handovers#49 (closed)

mentioned in issue on-call-handovers#57 (closed)

mentioned in issue on-call-handovers#58 (closed)

mentioned in issue production#1355 (closed)

mentioned in issue production#1337 (closed)

Update

I executed the production CR: production#1337 (closed). There were few behavior differences that I noticed during the CR that I want to point out below and investigate further.

Time To Complete Command

In gstg when I changed the maxmemory setting, the SET CONFIG command immediately executed. However, in gprd it took a good minute for the command to complete and respond with OK message.

Time To Take Effect

In gstg when I changed the maxmemory setting, used_memory_vss immediately dropped down. However in gprd it didn't change the used_memory_vss at all. After good 10-15 minute, I tried changing the maxmemory again (this was the 2nd reduction) to 57GB from 59GB. After a few minutes, the used_memory_vss went all the way down to 30+GB. So at this point, it is not clear whether it just took long time for the change to take effect OR if the 2nd change triggered the vss change.

Failover

Originally, redis-cache-01 was primary. After the 2nd change mentioned above there was a failover and redis-cache-02 became the primary. During a high load of key-eviction process, sentinel must have thought that the primary was down and triggered a failover. So maxmemory setting change CAN cause a failover (even if the delta is 1GB or 2GB)

2019-11-13_05:32:59.32653 6995:X 13 Nov 05:32:59.326 # +sdown slave 10.217.5.101:6379 10.217.5.101 6379 @ gprd-redis-cache 10.217.5.103 6379
2019-11-13_05:33:03.13352 6995:X 13 Nov 05:33:03.133 # -sdown slave 10.217.5.101:6379 10.217.5.101 6379 @ gprd-redis-cache 10.217.5.103 6379
2019-11-13_05:57:33.42712 6995:X 13 Nov 05:57:33.427 # +sdown slave 10.217.5.101:6379 10.217.5.101 6379 @ gprd-redis-cache 10.217.5.103 6379
2019-11-13_05:57:49.45354 6995:X 13 Nov 05:57:49.453 # -sdown slave 10.217.5.101:6379 10.217.5.101 6379 @ gprd-redis-cache 10.217.5.103 6379
2019-11-13_06:03:37.39608 6995:X 13 Nov 06:03:37.396 # +sdown slave 10.217.5.102:6379 10.217.5.102 6379 @ gprd-redis-cache 10.217.5.103 6379
2019-11-13_06:04:10.31153 6995:X 13 Nov 06:04:10.311 # -sdown slave 10.217.5.102:6379 10.217.5.102 6379 @ gprd-redis-cache 10.217.5.103 6379
2019-11-13_06:04:50.92673 6995:X 13 Nov 06:04:50.926 # +sdown master gprd-redis-cache 10.217.5.103 6379
2019-11-13_06:04:51.09686 6995:X 13 Nov 06:04:51.096 # +new-epoch 28
2019-11-13_06:04:51.09928 6995:X 13 Nov 06:04:51.099 # +vote-for-leader 696464e96fb72bb11ecaf10122a1d2048f172666 28
2019-11-13_06:04:51.57473 6995:X 13 Nov 06:04:51.574 # +config-update-from sentinel 696464e96fb72bb11ecaf10122a1d2048f172666 10.217.5.121 26379 @ gprd-redis-cache 10.217.5.103 6379
2019-11-13_06:04:51.57477 6995:X 13 Nov 06:04:51.574 # +switch-master gprd-redis-cache 10.217.5.103 6379 10.217.5.102 6379
2019-11-13_06:04:51.57492 6995:X 13 Nov 06:04:51.574 * +slave slave 10.217.5.101:6379 10.217.5.101 6379 @ gprd-redis-cache 10.217.5.102 6379
2019-11-13_06:04:51.57494 6995:X 13 Nov 06:04:51.574 * +slave slave 10.217.5.103:6379 10.217.5.103 6379 @ gprd-redis-cache 10.217.5.102 6379
2019-11-13_06:05:01.61720 6995:X 13 Nov 06:05:01.617 # +sdown slave 10.217.5.103:6379 10.217.5.103 6379 @ gprd-redis-cache 10.217.5.102 6379
2019-11-13_06:05:11.62705 6995:X 13 Nov 06:05:11.627 # -sdown slave 10.217.5.103:6379 10.217.5.103 6379 @ gprd-redis-cache 10.217.5.102 6379
2019-11-13_06:15:03.68193 6995:X 13 Nov 06:15:03.681 # +sdown slave 10.217.5.103:6379 10.217.5.103 6379 @ gprd-redis-cache 10.217.5.102 6379
2019-11-13_06:21:27.02623 6995:X 13 Nov 06:21:27.026 # -sdown slave 10.217.5.103:6379 10.217.5.103 6379 @ gprd-redis-cache 10.217.5.102 6379
2019-11-13_06:25:45.74835 6995:X 13 Nov 06:25:45.748 # +sdown master gprd-redis-cache 10.217.5.102 6379
2019-11-13_06:25:48.15704 6995:X 13 Nov 06:25:48.156 # -sdown master gprd-redis-cache 10.217.5.102 6379

Secondaries

There was a very interesting finding during the aftermath of the CR and investigation. This graph shows that there is always 1 secondary even though we technically have 2 secondaries: https://dashboards.gitlab.net/d/redis-cache-main/redis-cache-overview?orgId=1&from=1573547782773&to=1573720582773&var-PROMETHEUS_DS=Global&var-environment=gprd.

Before 5AM UTC on 11/13, we see that redis-cache-02 was the only secondary. However, after the failover the secondary switched from redis-cache-02 => redis-cache-03. Why is it that we should only have 1 secondary showing up on this graph?

This can be further seen from a graph like Expired Keys which also shows just the secondary host with keys expiring.

Shouldn't both secondary hosts be here?

Sentinel still shows 2 slaves as below. But I am not able to do any redis-cli operation on redis-cache-03 right now.

redis-cache-sentinel-02-db-gprd.c.gitlab-production.internal:/home/amar# /opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel masters
1)  1) "name"
    2) "gprd-redis-cache"
    3) "ip"
    4) "10.217.5.102"
    5) "port"
    6) "6379"
    7) "runid"
    8) "6529927f8711a4d9b86b1d0449b44e1a3129143b"
    9) "flags"
   10) "master"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "892"
   19) "last-ping-reply"
   20) "892"
   21) "down-after-milliseconds"
   22) "10000"
   23) "info-refresh"
   24) "2805"
   25) "role-reported"
   26) "master"
   27) "role-reported-time"
   28) "95680285"
   29) "config-epoch"
   30) "28"
   31) "num-slaves"
   32) "2"

mentioned in issue production#1380 (closed)

mentioned in issue production#1381 (closed)

removed workflow-infraIn Progress label

changed milestone to %W47 Dev & Ops

mentioned in issue production#1413 (closed)

added ServiceRedis label and removed 1 deleted label

mentioned in issue production#1429 (closed)

changed milestone to %Reliability December 2019

@andrewn - When you get a chance, can I please get your thought on the above redis observation after the maxmemory setting adjustment CR we did about a couple of weeks ago? At this time, I don't have high confidence that when we do something similar it won't show similar behavior which caused a little turbulence in prod.

mentioned in issue production#1459 (closed)

added teamReliability label

mentioned in issue production#1500 (closed)

Took a look into it 1 more time. Based on: https://dashboards.gitlab.net/d/alerts-saturation_component/alerts-saturation-component-alert?orgId=1&var-environment=gprd&var-type=redis-cache&var-stage=main&var-component=redis_memory&from=1569383709591&to=1577159709591&var-PROMETHEUS_DS=Global

it looks like since the 11/13 CR, memory utilization definitely went down. However, it is still hovering little more than what we set the maxmemory setting to. But given we have a Hard SLO threshold we should know if we start to breach it again and if we do, we can revisit this again.

closed

mentioned in issue gitlab-org/gitlab#199951

added workflow-infraDone label

Resource saturation: redis_memory on redis-cache

Designs

Child items ...

Activity

Update

Time To Complete Command

Time To Take Effect

Failover

Secondaries