Add registry DB load balancing alerts

Related to DLB: Create new .com alerts based on the new ap... (gitlab-org/container-registry#1408 - closed). Database load balancing is currently only available on staging. having these alerts in place is a pre-requisite for heading to production (pending production readiness review). Underlying metrics/dashboards added in !8112 (merged).

Workflow

  • Copy Alert Playbook Template to /docs/(service-name)/alerts/
  • Name the playbook file the same as the alert (or alert prefix when combining alerts with a different suffix such as Error, Delayed, Missing)
  • Update the runbook link in the associated alert definition to point to the new playbook
  • Run make generate

Overview

  • What does this alert mean?
  • What factors can contribute?
  • What parts of the service are effected?
  • What action is the recipient of this alert expected to take when it fires?

Services

  • All alerts require one or more Service Overview links
  • Team that owns the service

Metrics

  • Briefly explain the metric this alert is based on and link to the metrics catalogue. What unit is it measured in? (e.g., CPU usage in percentage, request latency in milliseconds)
  • Explain the reasoning behind the chosen threshold value for triggering the alert. Is it based on historical data, best practices, or capacity planning?
  • Describe the expected behavior of the metric under normal conditions. This helps identify situations where the alert might be falsely firing.
  • Add screenshots of what a dashboard will look like when this alert is firing and when it recovers
  • Are there any specific visuals or messages one should look for in the screenshots?

Alert Behavior

  • Information on silencing the alert (if applicable). When and how can silencing be used? Are there automated silencing rules?
  • Expected frequency of the alert. Is it a high-volume alert or expected to be rare?
  • Show historical trends of the alert firing e.g Kibana dashboard

Severities

  • Guidance for assigning incident severity to this alert
  • Who is likely to be impacted by this cause of this alert?
    • All gitlab.com customers or a subset?
    • Internal customers only?
  • Things to check to determine severity

Verification

  • Prometheus link to query that triggered the alert
  • Additional monitoring dashboards
  • Link to log queries if applicable

Recent changes

  • Links to queries for recent related production change requests
  • Links to queries for recent cookbook or helm MR's
  • How to properly roll back changes

Troubleshooting

  • Basic troubleshooting order
  • Additional dashboards to check
  • Useful scripts or commands

Possible Resolutions

  • Links to past incidents where this alert helped identify an issue with clear resolutions

Dependencies

  • Internal and external dependencies which could potentially cause this alert

Escalation

  • How and when to escalate
  • Slack channels where help is likely to be found:

Definitions

  • Link to the definition of this alert for review and tuning
  • Advice or limitations on how we should or shouldn't tune the alert
  • Link to edit this playbook

Related Links

  • Related alerts link to the alerts folder containing this playbook
  • Related documentation
Edited by João Pereira

Merge request reports

Loading