Add ChefClientErrorCritical Alert Playbook

Overview

Briefly explain the metric this alert is based on and link to the metrics catalogue. What unit is it measured in? (e.g., CPU usage in percentage, request latency in milliseconds)
Explain the reasoning behind the chosen threshold value for triggering the alert. Is it based on historical data, best practices, or capacity planning?
Describe the expected behavior of the metric under normal conditions. This helps identify situations where the alert might be falsely firing.
Add screenshots of what a dashboard will look like when this alert is firing and when it recovers
Are there any specific visuals or messages one should look for in the screenshots?

Information on silencing the alert (if applicable). When and how can silencing be used? Are there automated silencing rules?
Expected frequency of the alert. Is it a high-volume alert or expected to be rare?
Show historical trends of the alert firing e.g Kibana dashboard

Guidance for assigning incident severity to this alert
Who is likely to be impacted by this cause of this alert?
- All gitlab.com customers or a subset?
- Internal customers only?
Things to check to determine severity

Links to past incidents where this alert helped identify an issue with clear resolutions