Scaling actioncable workload
Originates from https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/issues/1767#note_1874677381. Due to the increase in actioncable load, redis-pubsub
primary CPU utilisation has gone up considerable and is forecasted to exceed the soft thresholds in June 2024.
Short-term
The short-term mitigation is to separate the workload from redis-pubsub
. This redis instance contains 2 workloads: workhorse and actioncable. That was migrated as part of &1066 (closed) to move workloads incompatible with Redis-Cluster onto a standalone Redis (with sentinel). At the time, we moved both workloads into 1 instance since the workload is similar in nature (pub/sub) and considerably small.
As we have performed a similar migration in production#16436 (closed), the complexity of this task is considerably low. The work required:
- Provisioning new Redis using gitlab-redis cookbook
- Application changes to use MultiStore in actioncable initializer
- Perform migration using a change management issue
Note that there are other concurrent efforts in gitlab-org/gitlab#457683 to reduce the workload but we are unlikely to have sufficient time before the forecasted saturation happens.
Long-term
We can explore horizontally scaling this workload using Redis Cluster and sharded Pub/Sub. This will require modifications to the ActionCable
adapter since does not work with a Redis Cluster. That can be done by patching or upstreaming a Redis Cluster adapter to rails. The former is likely the approach since upstreaming would require us to bump our Rails version (no trivial feat).
We would need to consider the delivery timeline of cells and the state of the Redis instance post migration, to evaluate if there is a strong need to invest efforts into this.
There is a draft MR which uses a Redis Cluster actioncable adapter. There is also the problem of live migration since the commands are different now (e.g. publish
vs spublish
).
Other useful links of past discussions