Reduce excess sidekiq-catchall clients to reduce unproductive BRPOP work

We are close to the limit with redis-sidekiq. Efforts in this space include &423 (closed) and (the failed) https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12343.

Previous analysis on redis-sidekiq profiles suggests a lot of time on BRPOP timing out and re-executing. This suggests we may have an excess in sidekiq clients that are trying to consume from a queue and timing out frequently.

To gain more evidence around this hypothesis, we can look at the client list in redis and see how many are in a blpop state.

Analysis

blpop clients per IP, we can see 7 clients in the [64, 128) bucket.

➜  ~ ssh redis-sidekiq-03-db-gprd.c.gitlab-production.internal sudo gitlab-redis-cli client list | grep cmd=brpop | awk '{ print $2 }' | sed -E 's/addr=(.+):.+/\1/g' | sort | uniq -c | awk '{ print $1 }' | log2hist

             [1]        6  ∎
          [2, 4)       21  ∎∎∎∎∎
          [4, 8)      214  ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
         [8, 16)      147  ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
        [16, 32)        0
        [32, 64)        0
       [64, 128)        7  ∎

How many blocking clients do these IPs have?

$ sudo gitlab-redis-cli client list | grep cmd=brpop | awk '{ print $2 }' | sed -E 's/addr=(.+):.+/\1/g' | sort | uniq -c | sort -nr | head

    119 10.220.7.228
    118 10.220.7.235
    118 10.220.7.229
    117 10.220.7.236
    117 10.220.7.226
    116 10.220.7.227
    114 10.220.7.211
     15 10.222.96.200
     15 10.222.96.189
     15 10.222.81.46

Over 100 blocking clients per IP.

Who do these IPs belong to?

$ sudo gitlab-redis-cli client list | grep cmd=brpop | awk '{ print $2 }' | sed -E 's/addr=(.+):.+/\1/g' | sort | uniq -c | sort -nr | awk '$1 > 100 { print $2 }' | xargs -n1 dig +short -x | sort

sidekiq-catchall-01-sv-gprd.c.gitlab-production.internal.
sidekiq-catchall-02-sv-gprd.c.gitlab-production.internal.
sidekiq-catchall-03-sv-gprd.c.gitlab-production.internal.
sidekiq-catchall-04-sv-gprd.c.gitlab-production.internal.
sidekiq-catchall-05-sv-gprd.c.gitlab-production.internal.
sidekiq-catchall-06-sv-gprd.c.gitlab-production.internal.
sidekiq-catchall-07-sv-gprd.c.gitlab-production.internal.

Looks like they are all from the sidekiq-catchall fleet.

Hypothesis

I suspect we may be able to reclaim some cycles on redis-sidekiq by reducing the size or sidekiq concurrency of the sidekiq-catchall fleet.

cc @cmiskell @smcgivern @msmiley

Results summary

We learned some important new facts from this experiment. The wasted CPU time is real, but we are still missing some relevant aspect of what drives it. What we learned here helps us better interpret observations from production and synthetic tests, and it leads directly to a set of follow-up ideas.

Key findings are summarized below.

Amount of CPU time wasted in BRPOP timeout varies significantly

The BRPOP timeouts waste a variable amount of CPU time: Between roughly 6% and 12% of the overall CPU capacity is spent in this code path.

This may be at least partially due to variation in job creation rate. When more jobs are being created, the BRPOP commands watching those job queues are less likely to reach their timeout.

Insights

The amount of CPU time wasted on BRPOP timeouts might be at a minimum during the peak of Redis' daily workload cycle, when job creation rate is at its peak.
Some sidekiq shards may be more prone to BRPOP timeouts than others.

Impact

For this and other BRPOP timeout experiments, it is especially important to note whether a particular set of production measurements was captured during peak or non-peak workload.
Remapping which queues each sidekiq shard listens to may affect how likely that shard's BRPOP commands are to timeout.
- Since the timeout overhead is on the redis-server, it is effectively shared overhead among all connected clients.
- It is not yet clear what aspects of the workload affect how expensive BRPOP timeouts are. The count of blocking clients and the count of queues being watched by BRPOP may both be factors. We have some ideas for how to measure this as follow-up work.

Illustration

Variation in the BRPOP timeout overhead, captured during a 10-minute timestan. This example is from peak workload when the redis-server process on redis-sidekiq-03 (primary) was CPU saturated. The same kind of variation occurs during non-peak times of day when CPU is not saturated.

How to read this graph: The following graph shows a timeline where more red indicates more BRPOP timeouts. Each column represents 1 second of profiling data, and each dot represents a count of how many of the sampled events were in the BRPOP timeout code path.

With the aggregation set to 25 buckets per second, each bucket aggregates 40 ms of wall clock time. Each bucket represents the number of times during that interval that the sampling profiler found a stack that was handling a BRPOP timeout for one of the blocking client connections. The profiler was configured to collect up to 500 samples per second, so each bucket could have at most 500 / 25 = 20 events.

In the reddest buckets in this graph (7 sampled events in a 40 ms interval), Redis spent approximately 35% (7 / 20) of its CPU time on the BRPOP timeout code path.

No improvement from reducing number of `catchall` sidekiq connections

Reducing the catchall sidekiq shard's number of redis client connections did not measurably reduce the CPU time spent by redis-server on BRPOP timeouts.

Initial testing looked promising, but when we discovered that the background variation was larger than the improvement we had observed, we realized this result could just be uncontrolled jitter.

Follow-up observations at a much finer granularity showed that the BRPOP timeout CPU overhead had no significant change when we reduced the number of sidekiq catchall nodes.

This contradicts our initial hypothesis.

We need a new hypothesis to explain what is driving the CPU cost of BRPOP timeouts. We have a couple ideas, outlined below in the Next Steps section.

Insights

Either of the following could potentially explain why stopping several catchall nodes did not reduce the BRPOP timeout overhead. Follow-up analysis can differentiate between these cases.

The BRPOP timeouts may be coming from a sidekiq shard other than catchall.
- The catchall shard's BRPOP commands watch a large number of queues, and a job being created on any of those queues can satisfy the command. So even though many of those queues individually have a low job creation rate, collectively their summed job creation rate may be high enough to avoid timeouts. If so, then the BRPOP timeouts must be associated with some other shard.
The BRPOP timeouts may be expensive due to being high duration rather than being high frequency.
- For example, if the large number of blocking client or the large number of watched queues are making individual timeout events relatively expensive, then just stopping a small percentage of those clients would not help much to reduce their impact.

Impact

Stopping sidekiq catchall nodes does not appear to be an effective mitigation after all.
Adjusting the BRPOP timeout for all sidekiq shards does still seem like a viable mitigation.
Identifying which sidekiq shards are associated with most of the BRPOP timeouts could lead to refining our mitigation options, such as:
- Focus timeout tuning on the sidekiq shards most prone to hitting their timeouts during Redis's peak workload.
- Reduce connection count for those shards (like we tried here with catchall).
- Inform our upcoming decisions about how best to reapportion the hot versus cold queues among sidekiq shards.

Illustration

The following shows the 15-minute timeline during which we serially stopped sidekiq nodes. The graph is broken into 3 chunks for easier viewing. We had hoped to see the amount of red cells in the graph diminish as fewer sidekiq clients were running BRPOP. Unfortunately, that does not seem to be the case.

How to read this graph: This is just like the above example, except that we switched to aggregating the event counts into 10 buckets per second (100 ms per bucket) instead of 25 buckets per second (40 ms per bucket). This coarser granularity made it easier to see the density variations across adjacent seconds. The profiler still captured up to 500 samples per second, so with this graph's 10 buckets/second granularity, each bucket can have up to 50 samples.

In the reddest buckets (16 sampled events in a 100 ms interval), Redis spent approximately 32% (16 / 50) of its CPU time on the BRPOP timeout code path. This is comparable to the previous graph that was captured shortly before running this experiment.

Supplemental details

For convenience, here is an annotated list of links to the details behind the above results summary.

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12898#note_536149620 Table of several 60-second profiles showing a range of different amounts of BRPOP timeout overhead being measured:
Profiling the timeline of the same overhead (BRPOP timeout) at the millisecond time scale, so we can explore the variation noted above:
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12898#note_536214205 Methodology
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12898#note_536814192 5-minute dataset, with commentary, during non-peak time of day
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12898#note_536825775 10-minute dataset, during non-peak time of day
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12898#note_537960103 10-minute dataset, during peak time of day while redis-sidekiq was at sustained CPU saturation
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12898#note_538082960 15-minute dataset, during peak time of day, while running the experiment to stop some sidekiq catchall nodes

Next steps

The new questions prompted by this analysis outcome are captured in the following follow-up issues:

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12947 - Identify which sidekiq shards are getting BRPOP timeouts
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12948 - What drives the CPU cost of BRPOP timeouts on Redis for Sidekiq?

Edited Mar 26, 2021 by Matt Smiley

Reduce excess sidekiq-catchall clients to reduce unproductive BRPOP work

Analysis

Hypothesis

Results summary

Amount of CPU time wasted in BRPOP timeout varies significantly

Insights

Impact

Illustration

No improvement from reducing number of catchall sidekiq connections

Insights

Impact

Illustration

Supplemental details

Next steps

No improvement from reducing number of `catchall` sidekiq connections