Redis failover should be tested as part of our QA integration test suite
GitLab the application relies heavily on the availability of Redis.
In order to ensure availability, we rely on Redis Sentinel to ensure a highly available Redis connection.
Unfortunately, as far as I know, we don't test this configuration in any meaningful way through integration tests.
Proposal
Include Redis failover as part of the staging QA test suite. This could be done by causing the Redis primary to fail in one of several ways, and ensure a minimum downtime on the staging cluster.
Examples of this integration test suite
Here are some examples. Other combinations could also be tried.
Test 1: Persistent Redis Crash
- Create a rails http session on staging by logging a test user in
- Wait for a predetermined period (1s?)
- "Violently" kill the Redis persistent primary will
kill -9
- Retry accessing the staging site until the site stops returning
5xx
errors - If this takes longer than a predetermined length of time (30s) fail the test
- If the session is lost, fail the test
Test 2: Redis Sidekiq
- Post a number of Chaos sleep jobs, enough to ensure that all workers are saturated (doing nothing)
- Post another Chaos sleep job, this time recording the job id (
jid
) - Immediately kill the Redis sidekiq primary and allow Sentinel to failover
- Ensure that the queued job, with the recorded
jid
is dequeued and not lost
Test 3: Redis Cache
The biggest concern with cache failover is that invalidated items can become "validated" once again if invalidation occurs at the time of failover. We should test for this with several cases.
- Using the branch cache (for example) ensure that the cache is populated for a given repository
- Invalidate the cache by pushing a new branch
- Induce Redis failover at this moment
- Ensure that the branch is included in the list of branches when queried via the API (as a means of ensuring the cache invalidation was propagated)
cc @meks for QA visibility