What drives the CPU cost of BRPOP timeouts on Redis for Sidekiq

Goal

BRPOP timeouts cost a non-trivial amount of CPU time, but we do not yet know why. Are there any ways we can reduce the overhead per timeout event?

This issue aims to determine if there is variation in the overhead of timeout event, as well as measure their frequency.

If the overhead per timeout is not constant, then we may be able to control some of the factors influencing that overhead.

Background motivation

The motivation here is that currently we have a single Redis instance supporting all of the sidekiq shards, and that Redis instance's main thread is dangerously close to CPU saturation during the daily peak workload. Profiling from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12898 suggests that BRPOP timeout represents a modest amount of CPU waste, so reducing that waste may give back a little head room, so that we can focus on longer term improvements to scaling redis-sidekiq.

Some possible outcomes

This issue is aiming to refine our current answer to "Where is the CPU time spent?". Here are some initial ideas of how that answer might be practically useful to our short-term and medium-term efforts to avoid redis CPU starvation.

Does a BRPOP timeout get more expensive when many queues are being watched?
- If so, the queue consolidation effort would likely become higher priority.
Does the number of blocking clients concurrently running BRPOP add much overhead in the periodic HZ-based clientCron clean up task that cleans up timed-out blocking clients?
- If so, the number of blocking clients may be more significant than the rate of timeouts. That finding would change our priorities for mitigations (making overall BRPOP concurrency reduction more important than timeout tuning).
- Since timeout events get triggered by walking the list of blocking clients, reducing the length of that list may reduce the overhead of evaluating which clients have reached timeout. But we do not yet know if this portion of the processing is cheap or expensive.

Implementation notes

Capturing the latency distribution of either all BRPOP commands or just BRPOP timeouts could potentially add a lot of overhead, depending on call frequency. So before capturing a latency distribution, first run a (cheaper) count of the events whose latency we want to measure. If the counts are acceptably low, then proceed with the latency histogram.

Code points to instrument:

unblockClient: both code paths
clientsCronHandleTimeout: spends most of its CPU time calling "unblockClient"
handleClientsBlockedOnKeys: spends most of its CPU time calling "unblockClient"