Research not using a single redis instance for everything
Background:
Please note for 9.3 this is a research task only, the outcome should be documentation/comments in this issue on what needs to be done and it's impact. In particular, upgrading needs to be considered.
Whenever we have a redis leader election in GitLab.com it introduces 10 minutes of downtime because that's the time it takes to flush all the data to disk and across nodes. Even though it's self healing the TTR is terrible and could be easily improved by just splitting our huge redis instance in multiple instances that each one takes care of some specifics.
For example:
- one for sidekiq.
- one for session storage.
- one for long polling.
- one for whatever else that is just caching.
I may be missing specifics of what else we use redis for, but in general I think we need to have a split between the kind of redis we can't lose data from (sidekiq), and everything else that can vanish away anytime.
So we could then set them up in such a way that we control the lifecycle of these instances, reducing the downtime induced by having to recover gigs from disk before the new master is up and running, and allowing us to scale instances better by splitting them acording to the scale needs.