Upgrade omnibus + Redis on sentinel-based deployments to enable increasing maxclients
Originally discussed in #2754.
Background
We have a latent risk of not being able to increase maxconns
on our sentinel-based Redis deployments. This is because:
- In order to increase
maxconns
, we must also bump the open filesulimit
which acts as an upper bound. This ulimit is hardcoded to 50k in omnibus. - Omnibus has received a patch that enables us to override this ulimit through chef. However, we have pinned an older version of omnibus (
15.11
) in order to avoid accidental Redis upgrades. In order to get this override option, we must upgrade omnibus. - As part of the omnibus upgrade, we also upgrade Redis. These two changes are coupled. So in order to have this option, we must perform a major Redis upgrade from 6.2 to (at least) 7.0.
- Unfortunately this Redis version bump increases the RDB version, which means we are dealing with a one-way upgrade. After the first promotion, we cannot easily go back. See: gitlab-com/gl-infra/sre-observability/redis-upgrade-harness!6 (comment 1747940812).
Headroom
We have a bit of maxconns
headroom left, but we are already reaching 75% at times:
The deployments that are most at-risk are redis-repository-cache
and redis-sidekiq
.
Note that these numbers represent the best case. Since we sample them only every 15 seconds, there may be shorter bursts beyond what we are seeing here.
Impact
If we reach this limit, Redis clients will be unable to connect, this will likely result in increased error rate and could become a major outage.
Proposal
We should upgrade the omnibus package on sentinel-based deployments so that we can safely increase ulimit + maxclients. Because of the one-way Redis upgrade, this requires some planning and coordination. We should test the procedure extensively on pre-prod. We should also design a break-glass downgrade procedure that may result in data loss.
Alternatives
There are a few alternative approaches that we can consider.
- Migrate workload away: By migrating workload away from at-risk deployments, we can buy some time or side-step the issue completely. This aligns with some already planned initiatives including Migrating redis-repository-cache to a Redis Cluster and Horizontally scale Sidekiq.
-
Bypass omnibus
gitlab-runsvdir.service
systemd unit: The only reason we cannot raise the ulimit is because of the hardcoded one in the omnibus start script. If we disablegitlab-runsvdir.service
and instead create our own Redis unit, we can defineLimitNOFILE
as high as we want. - Bypass omnibus entirely: Since omnibus limitations have created many headaches in the past, moving our sentinel-based Redis deployments away from it would address this issue, plus ease maintanance in the future. This could be a migration to Kubernetes or the introduction of a new cookbook. This is also being discussed here: #2818 (comment 1776482093).
-
Manually edit configuration: I consider this a last-ditch break-glass operation only. We can disable chef, add an increased ulimit to the service definition, and set an increased
maxconns
value in the Redis config. This is unsustainable and highly risky.