Skip to content

Prometheus alerts delivered from alertmanager into GitLab issues are silently being dropped

See gitlab-com/runbooks!2592 (merged) and gitlab-com/gl-infra/production#2451 (closed) for more details.

GitLab.com's AlertManager infrastructure delivers some alerts to GitLab.com issues, but these alerts are being silently dropped.

On Jul 23, 2020 @ 00:40:07.791, AlertManager delivered a webhook alert to GitLab.com:

Log entry (while it lasts) https://log.gprd.gitlab.net/app/kibana#/discover/doc/AW5F1e45qthdGjPJueGO/pubsub-rails-inf-gprd-003224?id=lWQceXMBOELd9C8V9tGa

The server responded with a 200.

The following params were delivered to GitLab.com:

{
  "key": "receiver",
  "value": "issue:gitlab\\.com/gitlab-com/gl-infra/production"
},
{
  "key": "status",
  "value": "firing"
},
{
  "key": "alerts",
  "value": "[{\"status\"=>\"firing\", \"labels\"=>{\"alert_type\"=>\"cause\", \"alertname\"=>\"SSLCertExpiresSoon\", \"env\"=>\"gprd\", \"environment\"=>\"gprd\", \"instance\"=>\"https://status.gitlab.com\", \"job\"=>\"blackbox\", \"monitor\"=>\"default\", \"pager\"=>\"issue\", \"project\"=>\"gitlab.com/gitlab-com/gl-infra/production\", \"provider\"=>\"gcp\", \"region\"=>\"us-east\", \"severity\"=>\"s2\", \"shard\"=>\"default\", \"stage\"=>\"main\", \"tier\"=>\"sv\", \"type\"=>\"blackbox\"}, \"annotations\"=>{\"description\"=>\"[FILTERED]\", \"runbook\"=>\"docs/frontend/ssl_cert.md\", \"title\"=>\"[FILTERED]\"}, \"startsAt\"=>\"2020-07-23T00:30:00.587237764Z\", \"endsAt\"=>\"0001-01-01T00:00:00Z\", \"generatorURL\"=>\"https://prometheus.gprd.gitlab.net/graph?g0.expr=probe_ssl_earliest_cert_expiry%7Bjob%3D%22blackbox%22%7D+-+time%28%29+%3C+14+%2A+86400&g0.tab=1\", \"fingerprint\"=>\"1f00c90951546e3b\"}]"
},
{
  "key": "groupLabels",
  "value": "{\"alertname\"=>\"SSLCertExpiresSoon\", \"env\"=>\"gprd\", \"stage\"=>\"main\", \"tier\"=>\"sv\", \"type\"=>\"blackbox\"}"
},
{
  "key": "commonLabels",
  "value": "{\"alert_type\"=>\"cause\", \"alertname\"=>\"SSLCertExpiresSoon\", \"env\"=>\"gprd\", \"environment\"=>\"gprd\", \"instance\"=>\"https://status.gitlab.com\", \"job\"=>\"blackbox\", \"monitor\"=>\"default\", \"pager\"=>\"issue\", \"project\"=>\"gitlab.com/gitlab-com/gl-infra/production\", \"provider\"=>\"gcp\", \"region\"=>\"us-east\", \"severity\"=>\"s2\", \"shard\"=>\"default\", \"stage\"=>\"main\", \"tier\"=>\"sv\", \"type\"=>\"blackbox\"}"
},
{
  "key": "commonAnnotations",
  "value": "{\"description\"=>\"[FILTERED]\", \"runbook\"=>\"docs/frontend/ssl_cert.md\", \"title\"=>\"[FILTERED]\"}"
},
{
  "key": "externalURL",
  "value": "http://alerts-01-inf-ops:9093"
},
{
  "key": "version",
  "value": "4"
},
{
  "key": "groupKey",
  "value": "{}/{env=\"gprd\",pager=\"issue\",project=\"gitlab.com/gitlab-com/gl-infra/production\"}:{alertname=\"SSLCertExpiresSoon\", env=\"gprd\", stage=\"main\", tier=\"sv\", type=\"blackbox\"}"
},
{
  "key": "namespace_id",
  "value": "gitlab-com/gl-infra"
},
{
  "key": "project_id",
  "value": "production"
},
{
  "key": "alert",
  "value": "{\"receiver\"=>\"issue:gitlab\\\\.com/gitlab-com/gl-infra/production\", \"status\"=>\"firing\", \"alerts\"=>[{\"status\"=>\"firing\", \"labels\"=>{\"alert_type\"=>\"cause\", \"alertname\"=>\"SSLCertExpiresSoon\", \"env\"=>\"gprd\", \"environment\"=>\"gprd\", \"instance\"=>\"https://status.gitlab.com\", \"job\"=>\"blackbox\", \"monitor\"=>\"default\", \"pager\"=>\"issue\", \"project\"=>\"gitlab.com/gitlab-com/gl-infra/production\", \"provider\"=>\"gcp\", \"region\"=>\"us-east\", \"severity\"=>\"s2\", \"shard\"=>\"default\", \"stage\"=>\"main\", \"tier\"=>\"sv\", \"type\"=>\"blackbox\"}, \"annotations\"=>{\"description\"=>\"[FILTERED]\", \"runbook\"=>\"docs/frontend/ssl_cert.md\", \"title\"=>\"[FILTERED]\"}, \"startsAt\"=>\"2020-07-23T00:30:00.587237764Z\", \"endsAt\"=>\"0001-01-01T00:00:00Z\", \"generatorURL\"=>\"https://prometheus.gprd.gitlab.net/graph?g0.expr=probe_ssl_earliest_cert_expiry%7Bjob%3D%22blackbox%22%7D+-+time%28%29+%3C+14+%2A+86400&g0.tab=1\", \"fingerprint\"=>\"1f00c90951546e3b\"}], \"groupLabels\"=>{\"alertname\"=>\"SSLCertExpiresSoon\", \"env\"=>\"gprd\", \"stage\"=>\"main\", \"tier\"=>\"sv\", \"type\"=>\"blackbox\"}, \"commonLabels\"=>{\"alert_type\"=>\"cause\", \"alertname\"=>\"SSLCertExpiresSoon\", \"env\"=>\"gprd\", \"environment\"=>\"gprd\", \"instance\"=>\"https://status.gitlab.com\", \"job\"=>\"blackbox\", \"monitor\"=>\"default\", \"pager\"=>\"issue\", \"project\"=>\"gitlab.com/gitlab-com/gl-infra/production\", \"provider\"=>\"gcp\", \"region\"=>\"us-east\", \"severity\"=>\"s2\", \"shard\"=>\"default\", \"stage\"=>\"main\", \"tier\"=>\"sv\", \"type\"=>\"blackbox\"}, \"commonAnnotations\"=>{\"description\"=>\"[FILTERED]\", \"runbook\"=>\"docs/frontend/ssl_cert.md\", \"title\"=>\"[FILTERED]\"}, \"externalURL\"=>\"http://alerts-01-inf-ops:9093\", \"version\"=>\"4\", \"groupKey\"=>\"{}/{env=\\\"gprd\\\",pager=\\\"issue\\\",project=\\\"gitlab.com/gitlab-com/gl-infra/production\\\"}:{alertname=\\\"SSLCertExpiresSoon\\\", env=\\\"gprd\\\", stage=\\\"main\\\", tier=\\\"sv\\\", type=\\\"blackbox\\\"}\"}"
}

The webhook was successfully delivered, but did not create an issue.

The alertmanager configuration is as follows:

- name: issue:gitlab.com/gitlab-com/gl-infra/production
  webhook_configs:
  - http_config:
      bearer_token: SECRET
    send_resolved: true
    url: https://gitlab.com/gitlab-com/gl-infra/production/prometheus/alerts/notify.json

In the case of the GitLab.com alert that was lost, we could have easily missed a SSL certificate renewal alert had it not been noticed through other means. It is critical for the availability of GitLab.com that our alerting infrastructure works as expected.

Therefore I'm marking this as ~P2 ~S2

cc @crystalpoole @sarahwaldner @bjk-gitlab