Make sure Redis client library reconnects after Redis failures (recovery)

Make sure Redis client library we use (see package.json) reconnects after Redis failures (recovery)

The webapp/ws servers need to reconnect to the Redis instance after the Redis sentinel says something is down and votes a new leader/master. From the past couple Gitter outages after the Redis sentinels sent our a bunch of alerts, we have had to restart the webapp/ws servers and things went back to normal.

What libraries are we using to connect to redis

graph TD
A[webapp] --> B["@gitterhq/env"]
B --> C["@gitterhq/redis-sentinel-client"]
B --> D[redis]
B --> E[ioredis]

redis@0.10.3
ioredis@1.15.1
@gitterhq/redis-sentinel-client@0.4.0
- git repo
- this seems like a perfect culprit of the sentinel failover incident

Follow-up to https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8821#note_266751150

Redis client timeouts (or lack of them)

Redis client is not timing out commands. When redis server keeps the connection open but doesn't send response (if you put the server to sleep with DEBUG SLEEP <seconds> or I assume when redis is slow to respond) the client just waits. This can result in the application completely hanging. I used wireshark to inspect the underlying TCP connection. The server keeps sending ACK packets to confirm received packets from client even when we put it to sleep. This AFAIK prevents any OS TCP handling to kick in and timeout. I'm not certain this would be same in production hang up scenario.

This is the property of both ioredis and redis npm modules. The most common way how to treat this problem is by introducing racing timeout promise. This can be seen in an example project redis-timeout-showcase that's been created to illustrate the issue. This way has been recommended in ioredis issues #139 and #61.

The issue for us is that if the connection isn't being closed, we never ask sentinel for a new, healthy master.

Another suggested option is to send PING command periodically for each redis client and if closing the client connection if the response doesn't come. I really like this because it keeps the timeout logic in one place as opposed to having to implement racing timeout Promise in every redis command call.

Edited Mar 31, 2020 by Tomas Vik