Skip to content

Identify which sidekiq shards are getting BRPOP timeouts

Goal

In https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12898 we discovered that the BRPOP timeouts may not be associated with the sidekiq catchall shard, as we had initially thought. This follow-up task aims to identify which groups of redis clients really are hitting their BRPOP timeout.

Answering this question would let us focus our tuning efforts on the relevant sidekiq shards' timeout and node count.

Background motivation

The motivation here is that currently we have a single Redis instance supporting all of the sidekiq shards, and that Redis instance's main thread is dangerously close to CPU saturation during the daily peak workload. Profiling from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12898 suggests that BRPOP timeout represents a modest amount of CPU waste, so reducing that waste may give back a little head room, so that we can focus on longer term improvements to scaling redis-sidekiq.

Implementation notes

Question: Which sidekiq clients are timing out?

  • Pcap analysis seems like the easiest way to answer this question, since we know the redis protocol and can count timeout responses by client IP.
  • Translating the client IP list to sidekiq shards may be hard if the clients are k8s pods.
  • Might need to instead identify distinct groups of clients by which queues their BRPOP commands specify.
Edited by Matt Smiley