Only enable workhorse keywatcher on API

TLDR

Workhorse has a "keywatcher" component that subscribes to a redis key in order to get notifications on certain changes. It uses this to serve long-polling requests.

This mechanism is putting a lot of load onto redis (5% of user cycles on the redis-persistent primary).

We discovered that all of our workhorse processes are subscribed to the redis key, and thus we are delivering messages to many workhorses that may not need them.

The only workhorse fleet that needs these messages is api. The proposal is to remove keywatcher from all other workhorse fleets: git, web, websockets.

Motivation

We are approaching the limit of our redis instances. They are mostly single-threaded and not easily horizontally scaleable. Other efforts are underway to address the scalability aspect.

Mitigations include enabling threaded I/O (a new feature in redis 6), which can buy us a bit of headroom.

Another mitigation is to reduce load on redis. For this, we performed some performance analysis and concluded that publish commands are using up ~5% of user mode CPU cycles. If we can reduce the volume or impact of these calls, we can get more headroom.

Workhorse keywatcher

The Redis publish/subscribe mechanism is used in workhorse in order to facilitate long-polling.

User requests may be left hanging, and only once an event happens, will the response be sent. That mechanism is only used by two endpoints, jobs/request and builds/register.json.

The keywatcher in workhorse subscribes to the redis key workhorse:notifications. The rails application will publish messages to this channel, delivering those messages to all workhorses.

Analyis

The message rate is not very high, peaking at around 250 msg/s.

avg(rate(gitlab_workhorse_keywatcher_total_messages{env="gprd"}[10m]))

However, the rate of total messages delivered peaks at 80k.

sum(rate(gitlab_workhorse_keywatcher_total_messages{env="gprd"}[10m]))

It turns out the reason is that we're enabling this feature on all workhorses, not only API ones. Especially with git and websocket workhorses being moved to k8s, we have plenty more of them and they autoscale. At peak to >200 workhorse processes total.

count by (type) (gitlab_workhorse_keywatcher_total_messages{env="gprd"})

If we disable this logic for everything except API, we can bring down the rate of published messages from 80k msg/s to 10k msg/s (250 msg/s * 40 api workhorses).

That's almost an order of magnitude reduction.

Side-note: This will also reclaim some CPU cycles on workhorse. According to the cloud profiler, this saves us 5% on workhorse-git (source), 3.5% on workhorse-web (source), 8.7% on workhorse-websockets (source).

Implementation

The workhorse config.toml file has a [redis] section with settings for sentinel and password. When that section is omitted, workhorse does not subscribe to the workhorse:notifications key.

On VMs, this is managed via omnibus. It is not yet possible to disable the behaviour, this needs to be added.

On k8s, this is managed via the CNG helm chart. The option to disable that setting also needs to be added here.

Context

Thanks

Huge thanks to @jacobvosmaer-gitlab for helping figure this out.

cc @andrewn @msmiley @cmiskell

Edited Mar 11, 2021 by Igor