Discussion: General Redis scalability

This is a synthesis of previous work/suggestions/plans/designs/thoughts on making our Redis deployments much more scalable. I've linked what I know of and found, but there may be more. Please feel free to add such links if you know of them. The intention is not to entirely revisit every such discussion, but to try and take what has been said and what we know now, and as quickly as possible set out some concrete next steps to get us to a long-term scalable Redis (or equivalent) solution.

Prior work

https://docs.google.com/document/d/1vKcT_PcX6qDHeWTA-e4we9eambKQPq6SGHnRLYN_1GM/edit#heading=h.52od5kr8peei
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12821 - Functionally partition Redis persistent
#305 (closed) - crossslot validator
&211 - Redis cluster compatibility prep
&423 - Sidekiq zonal clusters
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/4841 - twemproxy for Redis-sidekiq

Options

I think these are the broad options available.

Continue to functionally partition Redis
1. Cache:
  1. RackAttack
2. Persistent:
  1. Sessions
  2. CI
  3. Etag
  4. DB load balancing (maybe)
3. Sidekiq:
  1. Not as such (see sharding)
4. Tracechunks: no further partitioning, but it's got enormous growth capacity before we'll have to even think about it.
Shard each existing cluster into multiple independent Redis instances, managed by the app.
Proxy/bouncer layer:
- twemproxy
- https://github.com/CodisLabs/codis
  - Just found today, no firm opinion; language barrier maybe a challenge here.
Redis Cluster
KeyDB

Discussion

Functional partitioning is understood in the sense that we've done it 3 times already (cache, sideiq, tracechunks), and although each case is slightly different (usually in the migration plan, based on the specific data), we can manage that well enough. However doing this still ends up with irreducible chunks of data that must have a database with a single view. If at each stage we split out "significant" chunks of the workload to get sufficient benefit, the result is not much headroom at the long scale. For example, say RackAttack is using 30% of the current Redis CPU and we split it out; the new Redis then has only 3x growth before it becomes a problem again. That is a fairly large amount of headroom, but as I understand it we're looking much bigger picture here if possible (example: Redis-sidekiq CPU usage doubled in a year; 3x isn't much beyond that).

NB: We might very well want to functionally partition for other architectural reasons (e.g. splitting tracechunks was partly for performance, but also because of the usage pattern, monitoring possibilities, and blast radius of failures of that sub-system), but it'd be nicer to make that decision for those sort of reasons, not because we're frantically trying to keep Redis alive and just buying a little bit of time.

Sharding existing clusters (other than Sidekiq) is, IMO, a pit of despair and self-loathing that we should only descend into if all other options are intractable. The application would need to decide which server to use for any given bit of information. For Sidekiq, there is a comparatively simple approach at hand (e.g. zonal clusters) although that still has some sharp edges. However doing so for the others rapidly arrives at the all the "fun" of distributed systems including matters like consistent hashing, CAP theorem, and so on, where we need to implement those, rather than relying on other existing systems and ensuring they are correctly configured. It's doable, might be technically interesting, but is non-boring in ways that have some major failure modes/risks. I think this should be our absolute last resort, perhaps for very limited cases with semantics that make it easy.

Proxying: a potentially acceptable approach, but adds a layer, including network latency which, however small, won't be zero) and a distinct code-base intercepting and interpreting the traffic stream. It's comparable in concept to pgbouncer, but consider the expected execution time of SQL queries compared to Redis operations and the magnitude of impact an intermediate layer has as a result. Note also that twemproxy "was built primarily to reduce the number of connections to the caching servers on the backend" and we get horizontal scaling more as an additional feature. And there are gotchas, like twemproxy not supporting pubsub (we use that for CI). This is not an entirely wrong approach, but my instinct is it's not the most effective immediate path, and we should try other things first.

Redis Cluster: Considered scalable up to 1000 nodes. It has more failure modes for availability than we have now, but that's something we're likely going to have to live with no matter what. On the plus side, performance is generally the same for individual operations, so we truly get N times scalability of throughput. CROSSSLOT requests (calls, including using multi, with multiple keys that are in more than one slot) is the thing to worry about, but we have a validator (see #305 (closed) and gitlab-org/gitlab!32450 (merged)) and a known set of problem cases to work on.

KeyDB offers the possibility of better performance on a single multi-CPU node, but with cluster mode available later when we need it. If we refrain from using any KeyDB-specific features we don't have to worry particularly about omnibus deployments and just treat .com as a 'bring your own Redis deployment', just like customers using ElastiCache in AWS, MemoryStore in GCP etc. There's no guarantee that it'll be faster in our heavy use cases though; some of the benefit from KeyDB is replicated by I/O threading in core Redis, which we have turned on already. It does avoid the CROSSSLOT problem entirely (AFAIK; to be confirmed)

Proposal

Having spent not very long thinking this through, I propose a bake-off of Redis Cluster and KeyDB.

We need artificial workloads to mimic what we see in production; we have a Sidekiq test harness, but need similar for cache, persistent, and probably tracechunks (LUA-heavy). As for the Sidekiq test harness, we want to drive a base Redis instance to where we see similar performance data on Redis as in production (NB: we need to have replication; we didn't for the sidekiq tests, and I think that was a contributing factor to some disparities with reality). Recording and replaying traffic may be a quick approach.

Next we scale up that workload and find the point at which Redis melts. This shouldn't be hard; we're a long way towards that already (85%+ CPU saturation), so a binary search between 1x and 2x current loading should find the tipping point pretty quickly. We need the 'scaling up' capability for testing the alternatives, and this is a great opportunity to find out what our scalability runway is in production.

Then we repeat with Redis Cluster (number of nodes TBD, but let's see how high we can go), and KeyDB (single replicating node and cluster mode). We gain two things here:

Performance profiles/expectations of scaling capability
Some experience with management of the two technologies. We may find some pain points here that will be a problem later; all spidey-senses should be at full alert.

I don't yet know what criteria we'd use to choose between them; hopefully some critical detail comes from the testing that makes the choice obvious. CROSSSLOT may be a key factor, and this needs teasing out in the KeyDB case; it's obviously fine in a single node, but what about cluster mode? Are we just putting this problem off until the future when the multi-threading in KeyDB isn't enough?

Edited Sep 10, 2021 by Sean McGivern