2020-02-12: Chef client failures have reached critical levels
Summary
Chef client failures have reached critical levels.
More information will be added as we investigate the issue.
Timeline
All times UTC.
2020-02-12
- 16:45 - Yorick Peterse mentions in #production slack channel: "We are promoting canary to production, now that canary is in a healthy state again"
- 17:45 - Alert triggered:
Chef client failures have reached critical levels
- 17:54 - No reports of any service outages or degradation
- 18:13 - Alert triggered: Bad canary? The
cny
stage of thegitaly
service has an error-ratio exceeding SLO, but the main stage does not. - 18:18 - Alert re-triggered:
Chef client failures have reached critical levels
- 18:18 - Alert
resolved
: Bad canary? Thecny
stage of thegitaly
service has an error-ratio exceeding SLO, but the main stage does not. - 18:20 - From
#alerts-general
slack channel: Anomaly detection: Thestackdriver
service (main
stage) is receiving fewer requests than normal - 18:52 - Mayra reports failed deployer job https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/905707. This appears to be due to
No space left on device
errors for at leastapi-06-sv-gprd.c.gitlab-production.internal
which is a known issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9193
Additional details
This chart indicates that chef client failures began at "16:45 GMT
".
Edited by Nels Nelson