Add ChefClientErrorCritical Alert Playbook
Overview
-
What does this alert mean? -
What factors can contribute? -
What parts of the service are effected? -
What action is the recipient of this alert expected to take when it fires?
Services
-
All alerts require one or more Service Overview links -
Team that owns the service
Metrics
-
Briefly explain the metric this alert is based on and link to the metrics catalogue. What unit is it measured in? (e.g., CPU usage in percentage, request latency in milliseconds) -
Explain the reasoning behind the chosen threshold value for triggering the alert. Is it based on historical data, best practices, or capacity planning? -
Describe the expected behavior of the metric under normal conditions. This helps identify situations where the alert might be falsely firing. -
Add screenshots of what a dashboard will look like when this alert is firing and when it recovers -
Are there any specific visuals or messages one should look for in the screenshots?
Alert Behavior
-
Information on silencing the alert (if applicable). When and how can silencing be used? Are there automated silencing rules? -
Expected frequency of the alert. Is it a high-volume alert or expected to be rare? -
Show historical trends of the alert firing e.g Kibana dashboard
Severities
-
Guidance for assigning incident severity to this alert -
Who is likely to be impacted by this cause of this alert? -
All gitlab.com customers or a subset? -
Internal customers only?
-
-
Things to check to determine severity
Verification
-
Prometheus link to query that triggered the alert -
Additional monitoring dashboards -
Link to log queries if applicable
Recent changes
-
Links to queries for recent related production change requests -
Links to queries for recent cookbook or helm MR's -
How to properly roll back changes
Troubleshooting
-
Basic troubleshooting order -
Additional dashboards to check -
Useful scripts or commands
Possible Resolutions
-
Links to past incidents where this alert helped identify an issue with clear resolutions
Dependencies
-
Internal and external dependencies which could potentially cause this alert
Escalation
-
How and when to escalate -
Slack channels where help is likely to be found:
Definitions
-
Link to the definition of this alert for review and tuning -
Advice or limitations on how we should or shouldn't tune the alert -
Link to edit this playbook - Update the template used to format this playbook
Related Links
- Related alerts
-
Related documentation
Related to gitlab-com/gl-infra/production-engineering#25386 (closed)
Merge request reports
Activity
added typemaintenance label
assigned to @cmcfarland
mentioned in issue gitlab-com/gl-infra/production-engineering#25386 (closed)
1 Warning This merge request does not refer to an existing milestone. If needed, you can retry the
danger-review
job that generated this comment.Generated by
Dangerrequested review from @devin, @ahanselka, @swainaina, @astarovoytov, @thisisshreya, and @mattmi
- Resolved by Cameron McFarland
@cmcfarland It seems you only changed the old legacy rules. You might also need to update the path here
added 1 commit
- 463b9517 - Update alert definition location and runbook links
added 35 commits
-
463b9517...441ea9f7 - 34 commits from branch
master
- 95b871f3 - Merge branch 'master' into cmcfarland/ChefClientErrorCritical
-
463b9517...441ea9f7 - 34 commits from branch
requested review from @mchacon3 and @miladx and removed review request for @swainaina, @astarovoytov, @mattmi, and @thisisshreya
mentioned in commit 820e8e27
A pipeline is running on a mirror related to this merge request.
Status: starting
https://ops.gitlab.net/gitlab-com/runbooks/-/pipelines/3404251
This MR is included in version 3.108.0The release is available on GitLab release.
Your semantic-release bot