Sidekiq Redis experiment: split catchall by volume

Background

This is an experiment extracted from #956 (closed). We have these factors playing into our problems with CPU usage on our Redis instance for Sidekiq, but we don't know the weightings of them:

Number of clients performing BRPOP with ...
... a very long argument list (for the catchall shard) where ...
... some of those arguments represent frequently-used lists (Sidekiq queues).

Experiment

https://log.gprd.gitlab.net/goto/1493471c48275132c8cfe0ea11983607 shows that the top 6 queues on catchall perform over 50% of the jobs by volume. Those queues are:

update_namespace_statistics:namespaces_schedule_aggregation
web_hook
project_import_schedule
repository_update_mirror
pipeline_background:ci_build_trace_chunk_flush
projects_git_garbage_collect

If we moved those to a hypothetical 'catchsome' shard, we could give ourselves another small BRPOP list, and reduce the queue volume in the very long BRPOP list on the remaining catchall shard. If factor 3 above is a big factor, this might help.

Results

See #959 (comment 542551823)

Conclusions

This has a larger effect than simply splitting the queues into two sets, unless the experimental changes (multiple workers chosen randomly, for each shard) has had some effect. But I think the results are explicable by the dequeue book-keeping time in Redis being much reduced (only having 6 queues), for 50% of the catchall work load.

Edited Apr 01, 2021 by Craig Miskell