Sampled BRPOP queues for sidekiq-catchall

This is an idea I had for &194 (closed). It can be considered an alternative to &447 (closed) (though we may very well want to do both).

Summary

We know that listening on many queues is more expensive than listening on few queues.

Digging deeper into what this actually means: The CPU cost of the BRPOP command (used to dequeue) scales with arguments to BRPOP * messages (sidekiq jobs processed by all queues).

The proposal is to use a sample (say, 10%) of all 234 catchall queues in every BRPOP call. Since we have many workers, we are still able to serve all queues, but at a fraction of the cost (in terms of CPU cycles).

Background

We're dealing with scaling challenges on our redis-sidekiq fleet: scalability#590. CPU cycles are a scarce resource in redis. Thus we've been looking at opportunities to optimize.

One such optimization is to make BRPOP calls cheaper by reducing the number of queues per BRPOP (&194 (closed)). In fact it is the advice of sidekiq to use one queue per worker.

In order to verify that this optimization is worth making, we ran some synthetic experiments that showed 15-20% of a CPU being freed up by switching to one-queue-per-shard: scalability#956 (comment 542558210).

I wanted to reproduce these results and gather some more experimental data.

Analysis

In order to do so, I built a small microbenchmark harness: redis-brpop-microbench.

This tool will start a redis-server, run a synthetic BRPOP workload against it, and capture a CPU profile as well as statistics. This allows us to quantify the CPU impact under various configurations as well as attribute it to expensive code paths.

Hypothesis

BRPOP performance scales with the number of arguments (key names, aka queues).

Preparation

We can design an experiment that tests different BRPOP argument lengths.

To get an idea of the distribution, we can capture redis traffic from production and look at the BRPOP argument length. I reproduced @msmiley's analysis following our redis packet capture runbook on a 30 second pcap. This is off-peak (06:45 UTC), but hopefully still close enough.

Distribution of BRPOP invocations per distinct argument count:

$ ag --no-filename -B2 'brpop' | grep '^\*' | sort | uniq -c | sort -rn

11092 *234
8197 *20
5202 *18
4739 *103
 771 *3
 541 *9
 440 *37
  48 *1

Number of clients listening per argument length:

$ ag -l '^\*234\r\n\$5\r\nbrpop\r\n' | wc -l

...

1039 *234
 853 *103
  70 *37
 721 *20
 435 *18

Number of messages pushed per queue:

$ ag --no-filename lpush -A2 | grep '^resque' | sort | uniq -c | sort -nr | head

2840 resque:gitlab:queue:pipeline_cache:expire_pipeline_cache
2266 resque:gitlab:queue:pipeline_processing:pipeline_process
2094 resque:gitlab:queue:pipeline_hooks:build_hooks
2033 resque:gitlab:queue:pipeline_cache:expire_job_cache
1625 resque:gitlab:queue:repository_update_mirror
1420 resque:gitlab:queue:update_namespace_statistics:namespaces_schedule_aggregation
1203 resque:gitlab:queue:projects_git_garbage_collect
 813 resque:gitlab:queue:pipeline_background:ci_build_trace_chunk_flush
 740 resque:gitlab:queue:web_hook
 608 resque:gitlab:queue:jira_connect:jira_connect_sync_builds

This gives us a rough idea of which parameters to test.

Experiment

We will run redis-brpop-microbench with a fixed workload (100 producers pushing 1000 messages each). We will vary the number of keys in each BRPOP call. We will perform one warm-up run, and then we will take 3 measurements per configuration.

Baseline: 1k consumers, 234 brpop keys

$ redis-brpop-microbench -messages-per-producer=1000 -consumer-brpop-num-keys=234 -keys=234 -consumers=1000 2>&1 | jq

{
  "cpu_time_total": 16.333894,    # 15.921169, 16.143732
  "cpu_time_sys": 2.077744,       # 1.86083, 1.864541
  "cpu_time_user": 14.25615,      # 14.060339, 14.279191

  "brpop_usec": "6445827",        # 6330948, 6348956
  "brpop_usec_per_call": "63.82", # 62.68, 62.86
}

Raw data:

{
  "level": "info",
  "ts": "2021-04-06T09:47:34.308Z",
  "msg": "stats",
  "cpu_time_total": 16.333894,
  "cpu_time_sys": 2.077744,
  "cpu_time_user": 14.25615,
  "brpop_calls": "101000",
  "brpop_usec": "6445827",
  "brpop_usec_per_call": "63.82",
  "lpush_calls": "100000",
  "lpush_usec": "89657",
  "lpush_usec_per_call": "0.90"
}
{
  "level": "info",
  "ts": "2021-04-06T09:47:52.050Z",
  "msg": "stats",
  "cpu_time_total": 15.921169,
  "cpu_time_sys": 1.86083,
  "cpu_time_user": 14.060339,
  "brpop_calls": "101000",
  "brpop_usec": "6330948",
  "brpop_usec_per_call": "62.68",
  "lpush_calls": "100000",
  "lpush_usec": "87803",
  "lpush_usec_per_call": "0.88"
}
{
  "level": "info",
  "ts": "2021-04-06T09:48:17.298Z",
  "msg": "stats",
  "cpu_time_total": 16.143732,
  "cpu_time_sys": 1.864541,
  "cpu_time_user": 14.279191,
  "brpop_calls": "101000",
  "brpop_usec": "6348956",
  "brpop_usec_per_call": "62.86",
  "lpush_calls": "100000",
  "lpush_usec": "88139",
  "lpush_usec_per_call": "0.88"
}

Split down the middle: 1k consumers, 117 brpop keys

$ redis-brpop-microbench -messages-per-producer=1000 -consumer-brpop-num-keys=117 -keys=234 -consumers=1000 2>&1 | jq

{
  "cpu_time_total": 10.331576,    # 10.052743, 10.200656
  "cpu_time_sys": 1.7251,         # 1.649889, 1.661362
  "cpu_time_user": 8.606476,      # 8.402854, 8.539294

  "brpop_usec": "3565397",        # 3468252, 3511699
  "brpop_usec_per_call": "35.30", # 35.30, 34.77
}

Raw data:

{
  "level": "info",
  "ts": "2021-04-06T09:52:53.179Z",
  "msg": "stats",
  "cpu_time_total": 10.331576,
  "cpu_time_sys": 1.7251,
  "cpu_time_user": 8.606476,
  "brpop_calls": "101000",
  "brpop_usec": "3565397",
  "brpop_usec_per_call": "35.30",
  "lpush_calls": "100000",
  "lpush_usec": "82833",
  "lpush_usec_per_call": "0.83"
}
{
  "level": "info",
  "ts": "2021-04-06T09:53:08.453Z",
  "msg": "stats",
  "cpu_time_total": 10.052743,
  "cpu_time_sys": 1.649889,
  "cpu_time_user": 8.402854,
  "brpop_calls": "101000",
  "brpop_usec": "3468252",
  "brpop_usec_per_call": "34.34",
  "lpush_calls": "100000",
  "lpush_usec": "86104",
  "lpush_usec_per_call": "0.86"
}
{
  "level": "info",
  "ts": "2021-04-06T09:53:28.459Z",
  "msg": "stats",
  "cpu_time_total": 10.200656,
  "cpu_time_sys": 1.661362,
  "cpu_time_user": 8.539294,
  "brpop_calls": "101000",
  "brpop_usec": "3511699",
  "brpop_usec_per_call": "34.77",
  "lpush_calls": "100000",
  "lpush_usec": "84343",
  "lpush_usec_per_call": "0.84"
}

20%: 1k consumers, 47 brpop keys

$ redis-brpop-microbench -messages-per-producer=1000 -consumer-brpop-num-keys=47 -keys=234 -consumers=1000 2>&1 | jq

{
  "cpu_time_total": 5.933806,     # 5.72034, 5.692153
  "cpu_time_sys": 1.588612,       # 1.588612, 1.528481
  "cpu_time_user": 4.196494,      # 4.131728, 4.163672

  "brpop_usec": "1548100",        # 1532738, 1532738
  "brpop_usec_per_call": "15.33", # 15.18, 15.02
}

Raw data:

{
  "level": "info",
  "ts": "2021-04-06T09:57:47.339Z",
  "msg": "stats",
  "cpu_time_total": 5.933806,
  "cpu_time_sys": 1.737312,
  "cpu_time_user": 4.196494,
  "brpop_calls": "101000",
  "brpop_usec": "1548100",
  "brpop_usec_per_call": "15.33",
  "lpush_calls": "100000",
  "lpush_usec": "77582",
  "lpush_usec_per_call": "0.78"
}
{
  "level": "info",
  "ts": "2021-04-06T09:57:54.180Z",
  "msg": "stats",
  "cpu_time_total": 5.72034,
  "cpu_time_sys": 1.588612,
  "cpu_time_user": 4.131728,
  "brpop_calls": "101000",
  "brpop_usec": "1532738",
  "brpop_usec_per_call": "15.18",
  "lpush_calls": "100000",
  "lpush_usec": "76224",
  "lpush_usec_per_call": "0.76"
}
{
  "level": "info",
  "ts": "2021-04-06T09:58:01.530Z",
  "msg": "stats",
  "cpu_time_total": 5.692153,
  "cpu_time_sys": 1.528481,
  "cpu_time_user": 4.163672,
  "brpop_calls": "101000",
  "brpop_usec": "1516915",
  "brpop_usec_per_call": "15.02",
  "lpush_calls": "100000",
  "lpush_usec": "75620",
  "lpush_usec_per_call": "0.76"
}

10%: 1k consumers, 24 brpop keys

$ redis-brpop-microbench -messages-per-producer=1000 -consumer-brpop-num-keys=24 -keys=234 -consumers=1000 2>&1 | jq

{
  "cpu_time_total": 4.732149,     # 4.82061, 4.890596
  "cpu_time_sys": 1.639162,       # 1.825164, 1.799133
  "cpu_time_user": 3.092987,      # 2.995446, 3.091463

  "brpop_usec": "865158",         # 880852, 882376
  "brpop_usec_per_call": "8.57",  # 8.72, 8.74
}

Raw data:

{
  "level": "info",
  "ts": "2021-04-06T10:03:13.271Z",
  "msg": "stats",
  "cpu_time_total": 4.732149,
  "cpu_time_sys": 1.639162,
  "cpu_time_user": 3.092987,
  "brpop_calls": "101000",
  "brpop_usec": "865158",
  "brpop_usec_per_call": "8.57",
  "lpush_calls": "100000",
  "lpush_usec": "77826",
  "lpush_usec_per_call": "0.78"
}
{
  "level": "info",
  "ts": "2021-04-06T10:03:23.270Z",
  "msg": "stats",
  "cpu_time_total": 4.82061,
  "cpu_time_sys": 1.825164,
  "cpu_time_user": 2.995446,
  "brpop_calls": "101000",
  "brpop_usec": "880852",
  "brpop_usec_per_call": "8.72",
  "lpush_calls": "100000",
  "lpush_usec": "81820",
  "lpush_usec_per_call": "0.82"
}
{
  "level": "info",
  "ts": "2021-04-06T10:03:31.667Z",
  "msg": "stats",
  "cpu_time_total": 4.890596,
  "cpu_time_sys": 1.799133,
  "cpu_time_user": 3.091463,
  "brpop_calls": "100995",
  "brpop_usec": "882376",
  "brpop_usec_per_call": "8.74",
  "lpush_calls": "100000",
  "lpush_usec": "80248",
  "lpush_usec_per_call": "0.80"
}

fewer consumers: 100 consumers, 234 brpop keys

$ redis-brpop-microbench -messages-per-producer=1000 -consumer-brpop-num-keys=234 -keys=234 -consumers=100 2>&1 | jq

{
  "cpu_time_total": 5.008346,     # 5.143734, 5.11053
  "cpu_time_sys": 1.838826,       # 1.758868, 1.531365
  "cpu_time_user": 3.16952,       # 3.384866, 3.579165

  "brpop_usec": "1128040",        # 1183017, 1175302
  "brpop_usec_per_call": "11.26", # 11.81, 11.73
}

Raw data:

{
  "level": "info",
  "ts": "2021-04-06T10:09:05.451Z",
  "msg": "stats",
  "cpu_time_total": 5.008346,
  "cpu_time_sys": 1.838826,
  "cpu_time_user": 3.16952,
  "brpop_calls": "100142",
  "brpop_usec": "1128040",
  "brpop_usec_per_call": "11.26",
  "lpush_calls": "100000",
  "lpush_usec": "132750",
  "lpush_usec_per_call": "1.33"
}
{
  "level": "info",
  "ts": "2021-04-06T10:09:14.322Z",
  "msg": "stats",
  "cpu_time_total": 5.143734,
  "cpu_time_sys": 1.758868,
  "cpu_time_user": 3.384866,
  "brpop_calls": "100154",
  "brpop_usec": "1183017",
  "brpop_usec_per_call": "11.81",
  "lpush_calls": "100000",
  "lpush_usec": "138213",
  "lpush_usec_per_call": "1.38"
}
{
  "level": "info",
  "ts": "2021-04-06T10:09:23.166Z",
  "msg": "stats",
  "cpu_time_total": 5.11053,
  "cpu_time_sys": 1.531365,
  "cpu_time_user": 3.579165,
  "brpop_calls": "100163",
  "brpop_usec": "1175302",
  "brpop_usec_per_call": "11.73",
  "lpush_calls": "100000",
  "lpush_usec": "136093",
  "lpush_usec_per_call": "1.36"
}

Results

We can present the results as two charts.

CPU time

First we look at CPU time. This is the primary resource we are trying to save on.

We can see that system time stays the same, but user time varies a lot. It scales almost linearly with the argument list to BRPOP.

Screenshot_2021-04-06_at_12.31.10

brpop usec per call

Now we look at how much of that CPU cost is attributable to BRPOP execution.

We see a very similar pattern. The cost per BRPOP scales almost linearly with the argument list.

Screenshot_2021-04-06_at_12.31.23

Conclusion

This is additional evidence that BRPOP performance scales linearly with length of the argument list. Since the cost is paid for every delivered message, it is in fact O(n_queues*n_messages).

Side-note: number of clients

Lowering number of consumers also resulted in a massive reduction, but this is misleading. When experimenting further with different client counts, I observed a bimodal distribution, indicating a phase transition.

My hypothesis (based on flamegraphs) is that we pay a significant additional overhead when the queue becomes empty. When that happens, redis de-allocates the key and all associated metadata. Then, when the next message comes in, it needs to allocate those objects again to create the key.

When client counts are lower such that a few messages pile up before being drained, we amortise that allocation cost. Ironically, that suggests that coordinated pausing between BRPOPs would help quite a bit for this workload.

I've noted this upstream on this redis issue.

Proposal

By studying this problem at the redis level (as opposed to a sidekiq level), I had an insight.

Our goal is to reduce the argument list to BRPOP per call. What if instead of performing a BRPOP on all 234 keys, we pick a subset (say, 10%) of those keys at random?

Since we have many consumers, they will still serve all queues by random choice.

Just like the upstream Sidekiq::Fetch, our very own SemiReliableFetch already performs a weighted shuffle on the list of queues.

We can optionally pick off a subset from that shuffled list.

cc @cmiskell @smcgivern @reprazent @jacobvosmaer-gitlab @msmiley @qmnguyen0711

Edited by Igor