2020-08-04: Alertmanager failing to send alerts
Summary
Since sometime after 2020-07-28 16:17 UTC (exact time unclear yet), alert manager has not been successfully sending alerts to anywhere.
Alertmanager failing to send alerts
The error is of the form:
cancelling notify retry for "pagerduty" due to unrecoverable error: "note": failed to template "{{ template \"slack.text\" . }}": template: :1:12: executing "" at <{{template "slack.text" .}}>: template "slack.text" not defined
Timeline
All times UTC.
2020-07-13
- 13:49 - Errors started in GKE logs. Not clear why as there was no reload immediately before. The timing of gitlab-com/runbooks!2521 (merged) is highly co-incidental, but may just have triggered a latent issue.
2020-07-28
- 09:14 - Last pagerduty alert received
- 16:17 - Last notification received in #alerts
2020-07-29
- 08:52 - via manual application of https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1951 the old alertmanager instances (which were probably the ones successfully sending alerts) were removed.
2020-08-04
- 00:09 - cmiskell notices no alerts in #alerts for 6 days.
- 00:48 - cmiskell declares incident in Slack using
/incident declare
command. - 01:33 - Alerts start flowing again after configuration is corrected by gitlab-com/runbooks!2632 (merged)
Incident Review
Summary
During the migration of AlertManager (henceforth AM) to Kubernetes (k8s), differences in the mechanisms of providing configuration files (compared to traditional VMs) caused a templating configuration error to be introduced that prevented the k8s instances of AM from being able to send alerts to Slack or PagerDuty. This was fine during the interim periods as the non-k8s AM instances sent the alerts, but when those instances were shutdown at 2020-07-29 08:52 UTC, we ceased receiving any alerts from AM to PagerDuty or Slack. This error was only discovered at 2020-08-04 00:09 UTC when the incoming on-call engineer checked the (non-paging) #alerts Slack channel and noticed that there had been no messages there since 2020-07-29, when it is usually quite active with various low-grade noise. Investigations quickly found that AM was failing, and a corrected configuration was deployed at 01:33 UTC and alerts started being received again.
- Service(s) affected: Monitoring (Alert Manager)
- Team attribution: @gitlab-com/gl-infra/sre-observability
- Minutes downtime or degradation: 9641 minutes (6 days, 16 hours, 41 minutes)
Metrics
https://gitlab.com/gitlab-com/gl-infra/production/uploads/e8e3082b1f415bb1a5c85723862a4f75/image.png
Customer Impact
- Who was impacted by this incident? GitLab Reliability teams, losing primary alerts for other incidents/issues in the infrastructure.
- What was the customer experience during the incident? No alerts were received to pager duty (the most important ones), nor lower-grade Slack-based notifications.
- How many customers were affected? One, being GitLab itself.
Incident Response Analysis
- How was the event detected?
- Manual checking of known alerting channels
- How could detection time be improved?
- Review of the logs from the new AM containers during the ramp up phase (errors were reported consistently from as far back as 2020-07-13)
- Explicit testing of the critical alerting paths (particularly PagerDuty) after the shutdown of the old nodes, to verify that it was working
- How did we reach the point where we knew how to mitigate the impact?
- Not sure what to say here other than reviewing the logs, identifying the error message, and then doing standard debugging techniques to figure out why AM couldn't see the template configuration it was expecting
- How could time to mitigation be improved?
- Slightly more familiarity with the new deployment arrangements would have helped; we were surprised a bit at various points when configuration was deployed (by pipelines on ops, as it turns out) but weren't sure where to watch to know when to test/verify.
Post Incident Analysis
- How was the root cause diagnosed?
- Log review, and basic debugging. The failure was very obvious once we knew there was an issue, and the correction was just a matter of standard review of configuration vs documentation.
- How could time to diagnosis be improved?
- Having the logs available in elasticsearch, rather than just Stackdriver, and that being Well Known
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- No
- Was this incident triggered by a change (deployment of code or change to infrastructure.
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10546 and the implementation merge request https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1951
5 Whys
- Alerts were not sent to PagerDuty or Slack because the AM configuration was broken. Why?
- Template variables like
slack.title
were referenced in the alert configuration but the template files defining these (gitlab.tmpl and slack.tmpl) were not found by the configuration.
- Why could AM not find the template configuration?
- On a traditional VM, the .tmpl files would live in a
templates
subdirectory below alertmanager.yml and the glob be something like /etc/alertmanager/templates/*.tmpl, but this was not the location of the files in k8s containers. When converting the AM configuration to work with k8s,secrets
were used which means that the yml and tmpl files all end up in the same directory. Further, the actual base configuration directory created by the charts is/etc/alertmanager/config
.
- Why was this not noticed during the mixed VM + k8s ramp-up period?
- Logs were not reviewed.
- The two sets of AMs (VMs and k8s) were not fully meshed, which violates proper Prometheus practice. If they had been, we would have received alerts about the notification errors on the k8s AMs, from the non-k8s AMs.
- Why were the logs not checked?
- Alertmanager is generally silent unless something is wrong, so lacking any other evidence that something was awry, there was no driver to check the logs even cursorily. Additionally, the
snitch
heartbeat checks were working, because they are simpler (just a URL) and do not use the problematic templates, and this provided false comfort that alerting in general was working.
- Why were the two sets of AMs not fully meshed?
- @bjk-gitlab - can you elaborate on this? I guess because it was difficult/challenging in some way, but I don't know for sure and you likely have the answer to hand.
Lessons Learned
- Logs for new services (or newly located services) should always be reviewed, at least briefly, particularly when the deployment mechanism is so significantly different from the prior.
- After significant changes, all major/critical functionality should be explicitly tested, e.g. in this case, that alerts to PagerDuty were working. We have readiness reviews (https://gitlab.com/gitlab-com/gl-infra/readiness/-/issues) and one of these might have afforded an opportunity for other SREs to ask the right questions about testing.
- Quis custodiet ipsos custodes? Who Watches the Watchers? We need additional alerting to ensure we are alerted when there are failures in alerting, which can be challenging.
- Changes to Alerting infrastructure should follow the Change Management process and leverage a Change Management issue.
Corrective Actions
Implemented
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11039 - a DeadMansSnitch (DMS) on errors from AM. It's an inverse (always alerts if there are no errors, stops alerting when there are some) so it has a high probability of working in many failure modes and complements the existing "alive" heartbeat snitch.
- Explicitly identify the criticality of Alerting changes on the Change Management handbook page.
In progress
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11041 - PagerDuty to DMS snitch so we can tell, from DMS, if PagerDuty is down/failing to notify
- ETA: unknown
- DRI: @bjk-gitlab
Proposed
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11065
- Logs for AM in k8s should go to ElasticSearch, for more ergonomic access. This will need to be documented and socialized well within the SRE teams.
- If the logs are in ElasticSearch, we could also add an elasticwatch for errors from AM (there should be none under normal circumstances), which sends messages direct to Slack bypassing the Prometheus infrastructure and PagerDuty infrastructure entirely. It's a "positive hit detection" approach, so wouldn't notice if logging were broken, but is another path for alerts.
- Readiness review document, specifically owned by a team member from a different team (progress this)