Redis failover should be tested as part of our QA integration test suite

GitLab the application relies heavily on the availability of Redis.

In order to ensure availability, we rely on Redis Sentinel to ensure a highly available Redis connection.

Unfortunately, as far as I know, we don't test this configuration in any meaningful way through integration tests.

Proposal

Include Redis failover as part of the staging QA test suite. This could be done by causing the Redis primary to fail in one of several ways, and ensure a minimum downtime on the staging cluster.

Examples of this integration test suite

Here are some examples. Other combinations could also be tried.

Test 1: Persistent Redis Crash

Create a rails http session on staging by logging a test user in
Wait for a predetermined period (1s?)
"Violently" kill the Redis persistent primary will kill -9
Retry accessing the staging site until the site stops returning 5xx errors
If this takes longer than a predetermined length of time (30s) fail the test
If the session is lost, fail the test

Test 2: Redis Sidekiq

Post a number of Chaos sleep jobs, enough to ensure that all workers are saturated (doing nothing)
Post another Chaos sleep job, this time recording the job id (jid)
Immediately kill the Redis sidekiq primary and allow Sentinel to failover
Ensure that the queued job, with the recorded jid is dequeued and not lost

Test 3: Redis Cache

The biggest concern with cache failover is that invalidated items can become "validated" once again if invalidation occurs at the time of failover. We should test for this with several cases.

Using the branch cache (for example) ensure that the cache is populated for a given repository
Invalidate the cache by pushing a new branch
Induce Redis failover at this moment
Ensure that the branch is included in the list of branches when queried via the API (as a means of ensuring the cache invalidation was propagated)

cc @meks for QA visibility