Investigate alerting thresholds for WebPagesServiceWebPagesServerApdexSLOViolationRegional
Summary
Following a recent incident, it was noted that we alert for drops in apdex after 2minutes.
While this could indicate a problem for us if the trend continues without recovery, should we be alerting on-call engineers in a pattern like this where it recovers very quickly, indicating little or most likely no impact to customers.
Alert in question: https://gitlab.com/gitlab-com/runbooks/-/blob/master/thanos-rules/autogenerated-service-level-alerts-web-pages-gprd.yml#L734
Related Incident(s)
Originating issue(s): production#14359
Desired Outcome/Acceptance Criteria
Evaluate the effectiveness of the current alert and determinate new values if applicable.
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Reliability::P4')