Upgrade omnibus + Redis on sentinel-based deployments to enable increasing maxclients

Originally discussed in #2754 (closed).

Background

We have a latent risk of not being able to increase maxconns on our sentinel-based Redis deployments. This is because:

In order to increase maxconns, we must also bump the open files ulimit which acts as an upper bound. This ulimit is hardcoded to 50k in omnibus.
Omnibus has received a patch that enables us to override this ulimit through chef. However, we have pinned an older version of omnibus (15.11) in order to avoid accidental Redis upgrades. In order to get this override option, we must upgrade omnibus.
As part of the omnibus upgrade, we also upgrade Redis. These two changes are coupled. So in order to have this option, we must perform a major Redis upgrade from 6.2 to (at least) 7.0.
Unfortunately this Redis version bump increases the RDB version, which means we are dealing with a one-way upgrade. After the first promotion, we cannot easily go back. See: gitlab-com/gl-infra/sre-observability/redis-upgrade-harness!6 (comment 1747940812).

Headroom

We have a bit of maxconns headroom left, but we are already reaching 75% at times:

source

The deployments that are most at-risk are redis-repository-cache and redis-sidekiq.

Note that these numbers represent the best case. Since we sample them only every 15 seconds, there may be shorter bursts beyond what we are seeing here.

Impact

If we reach this limit, Redis clients will be unable to connect, this will likely result in increased error rate and could become a major outage.

Proposal

We should upgrade the omnibus package on sentinel-based deployments so that we can safely increase ulimit + maxclients. Because of the one-way Redis upgrade, this requires some planning and coordination. We should test the procedure extensively on pre-prod. We should also design a break-glass downgrade procedure that may result in data loss.

Alternatives

There are a few alternative approaches that we can consider.

Migrate workload away: By migrating workload away from at-risk deployments, we can buy some time or side-step the issue completely. This aligns with some already planned initiatives including Migrating redis-repository-cache to a Redis Cluster and Horizontally scale Sidekiq.
Bypass omnibus gitlab-runsvdir.service systemd unit: The only reason we cannot raise the ulimit is because of the hardcoded one in the omnibus start script. If we disable gitlab-runsvdir.service and instead create our own Redis unit, we can define LimitNOFILE as high as we want.
Bypass omnibus entirely: Since omnibus limitations have created many headaches in the past, moving our sentinel-based Redis deployments away from it would address this issue, plus ease maintanance in the future. This could be a migration to Kubernetes or the introduction of a new cookbook. This is also being discussed here: #2818 (comment 1776482093).
Manually edit configuration: I consider this a last-ditch break-glass operation only. We can disable chef, add an increased ulimit to the service definition, and set an increased maxconns value in the Redis config. This is unsustainable and highly risky.

Edited Feb 19, 2024 by Igor