2020-03-09 Alertmanager is failing sending notifications
Summary
We received an alert that the alert manager is failing to send notifications. The /var/log/prometheus/alertmanager/current
file on alerts-02-inf-gprd.c.GitLab-production.internal
containes lines like this:
2020-03-09_23:39:56.56926 level=error ts=2020-03-09T23:39:56.566Z caller=notify.go:372 component=dispatcher msg="Error on notify" err="unexpected status code 500: https://us-central1-gitlab-infra-automation.cloudfunctions.net/alertManagerBridge" context_err="context deadline exceeded"
2020-03-09_23:39:56.57010 level=error ts=2020-03-09T23:39:56.566Z caller=dispatch.go:301 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="unexpected status code 500: https://us-central1-gitlab-infra-automation.cloudfunctions.net/alertManagerBridge"
2020-03-09_23:39:58.87870 level=error ts=2020-03-09T23:39:58.878Z caller=notify.go:372 component=dispatcher msg="Error on notify" err="cancelling notify retry for \"webhook\" due to unrecoverable error: unexpected status code 408: https://us-central1-gitlab-infra-automation.cloudfunctions.net/alertManagerBridge" context_err=null
2020-03-09_23:39:58.88037 level=error ts=2020-03-09T23:39:58.880Z caller=dispatch.go:301 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="cancelling notify retry for \"webhook\" due to unrecoverable error: unexpected status code 408: https://us-central1-gitlab-infra-automation.cloudfunctions.net/alertManagerBridge"
2020-03-09_23:40:26.57027 level=error ts=2020-03-09T23:40:26.570Z caller=notify.go:372 component=dispatcher msg="Error on notify" err="unexpected status code 500: https://us-central1-gitlab-infra-automation.cloudfunctions.net/alertManagerBridge" context_err="context deadline exceeded"
2020-03-09_23:40:26.57198 level=error ts=2020-03-09T23:40:26.570Z caller=dispatch.go:301 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="unexpected status code 500: https://us-central1-gitlab-infra-automation.cloudfunctions.net/alertManagerBridge"
2020-03-09_23:40:45.67498 level=error ts=2020-03-09T23:40:45.674Z caller=notify.go:372 component=dispatcher msg="Error on notify" err="cancelling notify retry for \"webhook\" due to unrecoverable error: unexpected status code 422: https://gitlab.com/gitlab-com/gl-infra/infrastructure/prometheus/alerts/notify.json" context_err=null
2020-03-09_23:40:45.67692 level=error ts=2020-03-09T23:40:45.676Z caller=dispatch.go:301 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="cancelling notify retry for \"webhook\" due to unrecoverable error: unexpected status code 422: https://gitlab.com/gitlab-com/gl-infra/infrastructure/prometheus/alerts/notify.json"
2020-03-09_23:43:12.70933 level=error ts=2020-03-09T23:43:12.709Z caller=notify.go:372 component=dispatcher msg="Error on notify" err="cancelling notify retry for \"webhook\" due to unrecoverable error: unexpected status code 422: https://gitlab.com/gitlab-com/gl-infra/infrastructure/prometheus/alerts/notify.json" context_err=null
2020-03-09_23:43:12.71140 level=error ts=2020-03-09T23:43:12.711Z caller=dispatch.go:301 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="cancelling notify retry for \"webhook\" due to unrecoverable error: unexpected status code 422: https://gitlab.com/gitlab-com/gl-infra/infrastructure/prometheus/alerts/notify.json"
2020-03-09_23:45:23.22033 level=error ts=2020-03-09T23:45:23.220Z caller=notify.go:372 component=dispatcher msg="Error on notify" err="unexpected status code 500: https://us-central1-gitlab-infra-automation.cloudfunctions.net/alertManagerBridge" context_err="context deadline exceeded"
The graph shows both gprd
and ops
displaying the same behavior:
Timeline
All times UTC.
2020-03-09
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by 🤖 GitLab Bot 🤖