2020-08-04: Alertmanager failing to send alerts

Summary

Since sometime after 2020-07-28 16:17 UTC (exact time unclear yet), alert manager has not been successfully sending alerts to anywhere.

Alertmanager failing to send alerts

The error is of the form:

cancelling notify retry for "pagerduty" due to unrecoverable error: "note": failed to template "{{ template \"slack.text\" . }}": template: :1:12: executing "" at <{{template "slack.text" .}}>: template "slack.text" not defined

Timeline

All times UTC.

2020-07-13

13:49 - Errors started in GKE logs. Not clear why as there was no reload immediately before. The timing of gitlab-com/runbooks!2521 (merged) is highly co-incidental, but may just have triggered a latent issue.

2020-07-28

09:14 - Last pagerduty alert received
16:17 - Last notification received in #alerts

2020-07-29

08:52 - via manual application of https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1951 the old alertmanager instances (which were probably the ones successfully sending alerts) were removed.

2020-08-04

00:09 - cmiskell notices no alerts in #alerts for 6 days.
00:48 - cmiskell declares incident in Slack using /incident declare command.
01:33 - Alerts start flowing again after configuration is corrected by gitlab-com/runbooks!2632 (merged)

Incident Review

Summary

During the migration of AlertManager (henceforth AM) to Kubernetes (k8s), differences in the mechanisms of providing configuration files (compared to traditional VMs) caused a templating configuration error to be introduced that prevented the k8s instances of AM from being able to send alerts to Slack or PagerDuty. This was fine during the interim periods as the non-k8s AM instances sent the alerts, but when those instances were shutdown at 2020-07-29 08:52 UTC, we ceased receiving any alerts from AM to PagerDuty or Slack. This error was only discovered at 2020-08-04 00:09 UTC when the incoming on-call engineer checked the (non-paging) #alerts Slack channel and noticed that there had been no messages there since 2020-07-29, when it is usually quite active with various low-grade noise. Investigations quickly found that AM was failing, and a corrected configuration was deployed at 01:33 UTC and alerts started being received again.

Service(s) affected: Monitoring (Alert Manager)
Team attribution: @gitlab-com/gl-infra/sre-observability
Minutes downtime or degradation: 9641 minutes (6 days, 16 hours, 41 minutes)

Metrics

https://gitlab.com/gitlab-com/gl-infra/production/uploads/e8e3082b1f415bb1a5c85723862a4f75/image.png

Customer Impact

Who was impacted by this incident? GitLab Reliability teams, losing primary alerts for other incidents/issues in the infrastructure.
What was the customer experience during the incident? No alerts were received to pager duty (the most important ones), nor lower-grade Slack-based notifications.
How many customers were affected? One, being GitLab itself.

Incident Response Analysis

How was the event detected?

Manual checking of known alerting channels

How could detection time be improved?

Review of the logs from the new AM containers during the ramp up phase (errors were reported consistently from as far back as 2020-07-13)
Explicit testing of the critical alerting paths (particularly PagerDuty) after the shutdown of the old nodes, to verify that it was working

How did we reach the point where we knew how to mitigate the impact?

Not sure what to say here other than reviewing the logs, identifying the error message, and then doing standard debugging techniques to figure out why AM couldn't see the template configuration it was expecting

How could time to mitigation be improved?

Slightly more familiarity with the new deployment arrangements would have helped; we were surprised a bit at various points when configuration was deployed (by pipelines on ops, as it turns out) but weren't sure where to watch to know when to test/verify.

Post Incident Analysis

How was the root cause diagnosed?

Log review, and basic debugging. The failure was very obvious once we knew there was an issue, and the correction was just a matter of standard review of configuration vs documentation.

How could time to diagnosis be improved?

Having the logs available in elasticsearch, rather than just Stackdriver, and that being Well Known

Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?

Was this incident triggered by a change (deployment of code or change to infrastructure.

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10546 and the implementation merge request https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1951

5 Whys

Alerts were not sent to PagerDuty or Slack because the AM configuration was broken. Why?

Template variables like slack.title were referenced in the alert configuration but the template files defining these (gitlab.tmpl and slack.tmpl) were not found by the configuration.

Why could AM not find the template configuration?

On a traditional VM, the .tmpl files would live in a templates subdirectory below alertmanager.yml and the glob be something like /etc/alertmanager/templates/*.tmpl, but this was not the location of the files in k8s containers. When converting the AM configuration to work with k8s, secrets were used which means that the yml and tmpl files all end up in the same directory. Further, the actual base configuration directory created by the charts is /etc/alertmanager/config.

Why was this not noticed during the mixed VM + k8s ramp-up period?

Logs were not reviewed.
The two sets of AMs (VMs and k8s) were not fully meshed, which violates proper Prometheus practice. If they had been, we would have received alerts about the notification errors on the k8s AMs, from the non-k8s AMs.

Why were the logs not checked?

Alertmanager is generally silent unless something is wrong, so lacking any other evidence that something was awry, there was no driver to check the logs even cursorily. Additionally, the snitch heartbeat checks were working, because they are simpler (just a URL) and do not use the problematic templates, and this provided false comfort that alerting in general was working.

Why were the two sets of AMs not fully meshed?

@bjk-gitlab - can you elaborate on this? I guess because it was difficult/challenging in some way, but I don't know for sure and you likely have the answer to hand.

Lessons Learned

Logs for new services (or newly located services) should always be reviewed, at least briefly, particularly when the deployment mechanism is so significantly different from the prior.
After significant changes, all major/critical functionality should be explicitly tested, e.g. in this case, that alerts to PagerDuty were working. We have readiness reviews (https://gitlab.com/gitlab-com/gl-infra/readiness/-/issues) and one of these might have afforded an opportunity for other SREs to ask the right questions about testing.
Quis custodiet ipsos custodes? Who Watches the Watchers? We need additional alerting to ensure we are alerted when there are failures in alerting, which can be challenging.
Changes to Alerting infrastructure should follow the Change Management process and leverage a Change Management issue.

Corrective Actions

Implemented

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11039 - a DeadMansSnitch (DMS) on errors from AM. It's an inverse (always alerts if there are no errors, stops alerting when there are some) so it has a high probability of working in many failure modes and complements the existing "alive" heartbeat snitch.
Explicitly identify the criticality of Alerting changes on the Change Management handbook page.

In progress

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11041 - PagerDuty to DMS snitch so we can tell, from DMS, if PagerDuty is down/failing to notify

ETA: unknown
DRI: @bjk-gitlab

Proposed

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11065

Logs for AM in k8s should go to ElasticSearch, for more ergonomic access. This will need to be documented and socialized well within the SRE teams.
If the logs are in ElasticSearch, we could also add an elasticwatch for errors from AM (there should be none under normal circumstances), which sends messages direct to Slack bypassing the Prometheus infrastructure and PagerDuty infrastructure entirely. It's a "positive hit detection" approach, so wouldn't notice if logging were broken, but is another path for alerts.
Readiness review document, specifically owned by a team member from a different team (progress this)

Guidelines

Blameless RCA Guideline

Edited Aug 18, 2020 by Alberto Ramos

Admin message

2020-08-04: Alertmanager failing to send alerts

Summary

Alertmanager failing to send alerts

Timeline

Incident Review

Summary

Metrics

Customer Impact

Incident Response Analysis

Post Incident Analysis

5 Whys

Lessons Learned

Corrective Actions

Implemented

In progress

Proposed

Guidelines