Avoid running Redis RDB backups on primary nodes
Goal
Only run periodic Redis RDB backups on replica nodes, not the current primary node. This avoids triggering stalls and latency spikes in Redis that affect performance of its clients.
Approach
Currently we configure Redis to periodically make its own backups:
save 900 1
Replace that with a cronjob that first checks to see if the local Redis instance is currently in the primary role.
Only run an RDB backup if it is:
- up and healthy
- replica mode (not primary)
- the latest RDB backup is at least 15 minutes old
Caveat: Currently we do not make RDB backups on the redis-cache-XX instances, so they can safely skip this alternate approach too.
Background
Most of our Redis instances currently have RDB backups configured to run BGSAVE automatically with a 15 minute delay between after the previous run finishes:
save 900 1
Our redis-cache instances are an exception to this. They do not currently run RDB backups.
In #1183 (closed) we discovered that these RDB backups are causing periodic latency spikes as measured by clients using the "shared-state" Redis instance (redis-XX, a.k.a. "persistent redis").
https://log.gprd.gitlab.net/goto/294d9abc2547f5b4b6a59e9ee2abd1ed
Each of these latency spikes is at least partly explained by the 300 ms stall during fork and the subsequent extra memory access latency due to copy-on-write overhead. For more details, see the profiling and packet capture analysis results here:
#1183 (comment 645964316)
These RDB backups currently run on the primary and replica Redis instances. We don't need them to run on the primary, and that is the only place where they would cause client-facing latency.
