Alert when jobs are not being processed by sidekiq
Corrective action for:
- incident production#2154 (closed)
- RCA production#2158 (closed)
To avoid the situation where we are unaware of sidekiq queues that are not being processed at all we should implement alerting on low RPS that will be useful so that we can page the oncall for similar problems as in the linked incident. As a first iteration it would be good to have a very generous low threshold.
@andrewn suggested in slack:
We alert on any job that maintains a minimum 0.1 rps over the course of the day. If we don’t see it for 6 hours, we alert. Obviously this will we a bit noisy when queues are decommissioned