Root cause analysis for Redis-sidekiq performance degradation

Rootcause analysis of production#5148 (closed)

This will track the rootcause of the linked incident which is resolved

Status

2021-07-13 Current working theory:

The Lua script triggered by gitlab-exporter's probe_jobs started misbehaving (as in running for longer than expected) from 18:05 till 18:54: The script resulted in frequent blocking to other Redis calls, so Sidekiq couldn't process jobs at the usual rate (we dropped to about third of the usual rate):

See https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13776#note_624931998 for the detailed write-up.

As a corrective action we disabled probe_jobs in https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/306

Edited by John Jarvis