Increase replication buffer size for redis-cache
Summary
When a redis replica needs to be resynced (e.g. after a primary failover or when reinitializing a replica), the redis primary node needs to create a snapshot of its data, send that to the replica(s), wait for the replica(s) to ingest it, and then let the replicas consume and replay the backlog of writes that have accumulated since the snapshot. The buffer for accumulating that backlog of writes has a configurable max size (which historically has been 4 GB).
If that buffer fills up before the replica can start consuming it, then the SYNC attempt fails and must be restarted again from scratch. In that scenario, the primary closes its connection to the replica and discards the now overrun buffer. The replica may not notice this until it has finished loading the data from the RDB file it received, but that work is now useless and will be discarded during the next sync attempt.
This sequence of failed resync attempts repeated several times during incident production#7495 (closed). We estimate that the buffer was saturating just a little sooner than when the replica would have started consuming it. Increasing the buffer's max size by 50% should give a comfortable amount of extra headroom, so that the first resync attempt is much more likely to succeed even if the rate of bytes written to redis is moderately higher than usual.
Related Incident(s)
Originating issue(s): production#7495 (closed)
Desired Outcome/Acceptance Criteria
Increase client_output_buffer_limit_replica from 4 GB to 6 GB.
Note that this change has already been applied to the runtime configuration in production (but not in staging), as an incident mitigation. Here we aim to persist that tuning adjustment by adding it to chef and redis.conf.
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'priority::4')