Project 'gitlab-com/gl-infra/infrastructure' was moved to 'gitlab-com/gl-infra/production-engineering'. Please update any links and bookmarks that may still have the old path.
This may be due to fragmentation, etc, but we should likely lower the maxmemory setting in Redis a little further to ensure that we don't run into the OOM killer while performing CoW RDB snapshots, as we've seen in the past.
Redis will not always free up (return) memory to the OS when keys are removed. This is not something special about Redis, but it is how most malloc() implementations work. For example if you fill an instance with 5GB worth of data, and then remove the equivalent of 2GB of data, the Resident Set Size (also known as the RSS, which is the number of memory pages consumed by the process) will probably still be around 5GB, even if Redis will claim that the user memory is around 3GB. This happens because the underlying allocator can't easily release the memory. For example often most of the removed keys were allocated in the same pages as the other keys that still exist.
The previous point means that you need to provision memory based on your peak memory usage. If your workload from time to time requires 10GB, even if most of the times 5GB could do, you need to provision for 10GB.
So it looks like prior to September, 2018 we had the maxmemory set to 80 and changed it to 60. This explains why the RSS got up close to 80GB especially because if there are keys in use that are still on same pages as freed ones => causing fragmentation.
Before I do production change, I went to gstg and tried maxmemory change. In gstg we also have 60GB for maxmemory. However, the used_memory_rss was lower than the used_memory (which is not similar to what we have in production). But, in order to test to make sure the used_memory_rss goes down when maxmemory is changed I changed the maxmemory from 60GB to 40GB via config set maxmemory 42949672960. This worked and used_memory_rss went down as expected. I bumped the maxmemory back to 60GB. (This was all in staging)
I think what we can try is change the maxmemory to a lower value (say 59GB) on each redis-cache-0[1-3]-db-gprd host, see that the RSS memory goes down, and then bump it back to 60GB. This should restrict the RSS growing beyond the used_memory UNLESS fragmentation occurs again. (If fragmentation does occur, regardless of what lower value we set the maxmemory to we would still run into the issue).
I executed the production CR: production#1337 (closed). There were few behavior differences that I noticed during the CR that I want to point out below and investigate further.
Time To Complete Command
In gstg when I changed the maxmemory setting, the SET CONFIG command immediately executed. However, in gprd it took a good minute for the command to complete and respond with OK message.
Time To Take Effect
In gstg when I changed the maxmemory setting, used_memory_vss immediately dropped down. However in gprd it didn't change the used_memory_vss at all. After good 10-15 minute, I tried changing the maxmemory again (this was the 2nd reduction) to 57GB from 59GB. After a few minutes, the used_memory_vss went all the way down to 30+GB. So at this point, it is not clear whether it just took long time for the change to take effect OR if the 2nd change triggered the vss change.
Failover
Originally, redis-cache-01 was primary. After the 2nd change mentioned above there was a failover and redis-cache-02 became the primary. During a high load of key-eviction process, sentinel must have thought that the primary was down and triggered a failover. So maxmemory setting change CAN cause a failover (even if the delta is 1GB or 2GB)
Before 5AM UTC on 11/13, we see that redis-cache-02 was the only secondary. However, after the failover the secondary switched from redis-cache-02 => redis-cache-03. Why is it that we should only have 1 secondary showing up on this graph?
This can be further seen from a graph like Expired Keys which also shows just the secondary host with keys expiring.
Shouldn't both secondary hosts be here?
Sentinel still shows 2 slaves as below. But I am not able to do any redis-cli operation on redis-cache-03 right now.
@andrewn - When you get a chance, can I please get your thought on the above redis observation after the maxmemory setting adjustment CR we did about a couple of weeks ago? At this time, I don't have high confidence that when we do something similar it won't show similar behavior which caused a little turbulence in prod.
it looks like since the 11/13 CR, memory utilization definitely went down. However, it is still hovering little more than what we set the maxmemory setting to. But given we have a Hard SLO threshold we should know if we start to breach it again and if we do, we can revisit this again.