Observability of contents of one queue per shard

Currently, we have one queue per Sidekiq worker. In &447 we're looking to change that to one queue per shard, after which we will lose the ability (with our current metrics) to see what workers are waiting in the queue, where we can now simply by asking Redis for the length of each queue,

While we don't need this data to generate alerts (the queue size is sufficient for that, generally speaking, to indicate a problem), we may need it to be able to respond to such alerts, e.g. to find the worker for which a huge batch of jobs were just dumped on the queue blocking other jobs.

While metrics in prometheus and graphs in grafana would be the ideal mechanism (equivalent to current), at a minimum we need tooling (scripts etc) to enable inspection on the fly during an incident.

Efficiency of Redis CPU usage in obtaining this data is critical given that point of epic 447 is reducing CPU saturation.

Proposal

  1. Upgrade gitlab-exporter for sidekiq-redis only. (Unlike in #797 (closed), we do not need to move to Kubernetes here. We 'just' need to upgrade from the ancient version we're running to the current version.)
  2. Enable probe_jobs for Sidekiq. As this uses a Lua script to iterate over the entire working set, we might have performance issues here: if we do, we should stop and reconsider. (Maybe sample the head of that set instead?)

Status

  • Production is upgraded, and the probe_jobs is enabled. Change issue: production#4935
  • Overhead of the probe_jobs probe is usually trivial but has occasionally jumped as high as 44 milliseconds. Summary graphs are here: #1029 (comment 611877307)
  • We probably do need to give its queue traversal an upper bound (i.e. only walk the head N items in the queue) before we proceed with consolidating jobs into a single queue per shard.
Edited by Matt Smiley