Improve Chef alerting
Problems to Solve
- Only when 10% of a fleet fails to converge do we alert the on-call engineer. This means single Chef client convergence failures are often lost and only discovered when they become a blocking problem—in worst-case scenarios, during an incident, causing delays in restoring services to working condition.
- Currently, alerts are only generated for GSTG and GPRD. Other environments should also be monitored. This could prevent problems with db-benchmarking, ops, dev, and CustomersDot.
- Low-severity Chef failures are generating alerts, but they go to the #alerts channel, which is noisy and not used to monitor system health.
Proposed Actions
- Create a Fleet Management channel specifically for Chef (and possibly other services like GKE) to be monitored by subject matter experts.
- Update the Chef alert rules in the runbooks repository to monitor all Chef environments, not just GSTG and GPRD.
- Update Chef alerts to route to the new channel and potentially remove the alert for EoCs. Major provisioning issues should be caught earlier by the SME team, and that team can escalate with an incident if needed.
Additional Ideas
- Add Chef as a service in the service catalog.