Skip to content

Investigate alerting thresholds for WebPagesServiceWebPagesServerApdexSLOViolationRegional

Summary

Following a recent incident, it was noted that we alert for drops in apdex after 2minutes.
While this could indicate a problem for us if the trend continues without recovery, should we be alerting on-call engineers in a pattern like this where it recovers very quickly, indicating little or most likely no impact to customers.

Alert in question: https://gitlab.com/gitlab-com/runbooks/-/blob/master/thanos-rules/autogenerated-service-level-alerts-web-pages-gprd.yml#L734

Related Incident(s)

Originating issue(s): production#14359

Desired Outcome/Acceptance Criteria

Evaluate the effectiveness of the current alert and determinate new values if applicable.

Associated Services

Corrective Action Issue Checklist

  • Link the incident(s) this corrective action arose out of
  • Give context for what problem this corrective action is trying to prevent from re-occurring
  • Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • Assign a priority (this will default to 'Reliability::P4')