Corrective action: Workhorse and Load Balancer SLI interdependency for alerts
Summary
During a series of Workhorse degradations, the Workhorse service was passing 5xx errors back out to the clients and every 5xx response from Workhorse was also captured in the load balancer SLI error rate. This created a heavy load of pages for the EOC to acknowledge and work to silence.
Related Incident(s)
Originating issue(s): gitlab-com/gl-infra/production#ISSUE_ID
Desired Outcome/Acceptance Criteria
When the Workhorse error rate is alerting, do not alert on the loadbalancer error rate for the same service.
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose from -
Give context for what problem this corrective action is trying to prevent re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Reliability::P4' but should match the severity of the related incident) -
Assign a service label