2020-02-12: Chef client failures have reached critical levels

Summary

Chef client failures have reached critical levels.

More information will be added as we investigate the issue.

All times UTC.

2020-02-12

16:45 - Yorick Peterse mentions in #production slack channel: "We are promoting canary to production, now that canary is in a healthy state again"
17:45 - Alert triggered: Chef client failures have reached critical levels
17:54 - No reports of any service outages or degradation
18:13 - Alert triggered: Bad canary? The cny stage of the gitaly service has an error-ratio exceeding SLO, but the main stage does not.
18:18 - Alert re-triggered: Chef client failures have reached critical levels
18:18 - Alert resolved: Bad canary? The cny stage of the gitaly service has an error-ratio exceeding SLO, but the main stage does not.
18:20 - From #alerts-general slack channel: Anomaly detection: The stackdriver service (main stage) is receiving fewer requests than normal
18:52 - Mayra reports failed deployer job https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/905707. This appears to be due to No space left on device errors for at least api-06-sv-gprd.c.gitlab-production.internal which is a known issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9193

This chart indicates that chef client failures began at "16:45 GMT".

Edited Feb 12, 2020 by Nels Nelson