RCA: Redis secondaries losing connection with master
First, please check the incident issues: production#675 (closed), production#682 (closed) and production#686 (closed)
RCA
The initial reason for why both secondaries exceed the buffer limit is still unknown, but we have a theory as for why they don't recover afterwards.
From the master (redis-02) logs:
2019-02-13_10:53:01.23712 25363:M 13 Feb 10:53:01.236 * Synchronization with slave 10.217.2.101:6379 succeeded
2019-02-13_10:53:11.03960 25363:M 13 Feb 10:53:11.039 * Synchronization with slave 10.217.2.103:6379 succeeded
2019-02-13_10:53:25.99430 25363:M 13 Feb 10:53:25.994 # Client id=63274348 addr=10.217.2.103:36081 fd=4635 name= age=97 idle=1 flags=S db=0 sub=0 psub=0 mult
i=-1 qbuf=0 qbuf-free=0 obl=0 oll=62960 omem=2147483656 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
2019-02-13_10:53:26.00353 25363:M 13 Feb 10:53:26.003 # Client id=63274329 addr=10.217.2.101:44653 fd=4540 name= age=99 idle=0 flags=S db=0 sub=0 psub=0 mult
i=-1 qbuf=0 qbuf-free=0 obl=0 oll=62959 omem=2147552064 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
And from a secondary (redis-03) logs:
2019-02-13_10:51:51.95015 3670:S 13 Feb 10:51:51.950 * Full resync from master: 76eebdf64f90c1d873dcda25c16058f4e581b301:35311495700268
2019-02-13_10:51:51.95021 3670:S 13 Feb 10:51:51.950 * Discarding previously cached master state.
2019-02-13_10:52:43.52539 3670:S 13 Feb 10:52:43.525 * MASTER <-> SLAVE sync: receiving 1770264497 bytes from master
2019-02-13_10:53:12.30696 3670:S 13 Feb 10:53:12.306 * MASTER <-> SLAVE sync: Flushing old data
2019-02-13_10:53:46.24629 3670:S 13 Feb 10:53:46.246 * MASTER <-> SLAVE sync: Loading DB in memory
2019-02-13_10:54:14.54642 3670:S 13 Feb 10:54:14.546 * MASTER <-> SLAVE sync: Finished with success
2019-02-13_10:54:14.54648 3670:S 13 Feb 10:54:14.546 * 1 changes in 900 seconds. Saving...
2019-02-13_10:54:14.65519 3670:S 13 Feb 10:54:14.655 * Background saving started by pid 1824
2019-02-13_10:54:14.75329 3670:S 13 Feb 10:54:14.753 # Connection with master lost.
At 10:53:11.039 we can see that master marked redis-03 as synchronized and ready to accept further replication stream, but on redis-03 at the almost the same time (1 second difference) it is still at MASTER <-> SLAVE sync: Flushing old data, and it doesn't "finish with success" until 62 seconds later, after which it loses connection with master and start the whole loop once more.
During these 62 seconds, redis-03 probably can't receive any replication stream from master (since Redis is single-threaded), so master buffers them until it reaches the buffering limit (2 GB) and closes redis-03 connection.
Corrective Actions
Bump client-output-buffer-limit (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6229)
Set the hard limit to 4gb as having the same hard and soft limits defies the purpose of the latter.
Note that chef-client doesn't run gitlab-ctl reconfigure on redis nodes, and we can't run it ourselves as it will restart the service upon changing redis.conf. So updating the file manually and running config set <new config> in a Redis console is our best approach. Experiment in staging first!
Fix RedisFailover alert (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6230)
During the last two incidents, we got alerts about Redis failover. This is a false-positive as no failover happened. The alert definition uses redis_connected_slaves, which changes for this incident, triggering the alert needlessly. Also its title can cause some confusion as it lists both types of redis (redis and redis-cache) which can sends us investigating the wrong way.
