Root cause analysis for Redis-sidekiq performance degradation
Rootcause analysis of production#5148 (closed)
This will track the rootcause of the linked incident which is resolved
Status
2021-07-13 Current working theory:
The Lua script triggered by gitlab-exporter's
probe_jobs
started misbehaving (as in running for longer than expected) from 18:05 till 18:54: The script resulted in frequent blocking to other Redis calls, so Sidekiq couldn't process jobs at the usual rate (we dropped to about third of the usual rate):
See https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13776#note_624931998 for the detailed write-up.
As a corrective action we disabled probe_jobs
in https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/306
Edited by John Jarvis