Skip to content

feat: group alerts by service/type

Steve Xuereb requested to merge feat/groups-alerts-by-service into master

This attempted "failed", second attempted in !4715 (merged)

What

Group alerts by service

Group alerts by the type label using Alertmanager grouping to only send one alert per service.

Alertmanager group example

alertmanager-grouping.excalidraw

Update template for slack notification

Update alert template for slack notification to specify which service is firing and the list of alerts that fired. https://prometheus.io/docs/alerting/latest/notifications/ is a good reference for alertmanager templating

Before After
Slack
Screenshot_2022-06-09_at_10.43.52
PagerDuty Screenshot_2022-06-13_at_09.10.40
Slack
image
PagerDuty Screenshot_2022-06-13_at_09.06.59

Update silence button

Update the silence button to silence all the firing alerts for that type instead of a single alert, for example

Screenshot_2022-06-10_at_10.03.05

Why

As seen in gitlab-com/gl-infra&746 (closed) we have multiple alerts paging the SRE on-call in a few seconds for the same service. Instead of sending multiple pages which can be stressful, distracting and unclear which one to take action on, send only 1 per type

Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15765

Testing

Edited by Steve Xuereb

Merge request reports