feat: group alerts by service/type
What
Group alerts by service
Group alerts by the type
label using Alertmanager grouping to only send one alert per service.
alertmanager-grouping.excalidraw
Update template for slack notification
Update alert template for slack notification to specify which service is firing and the list of alerts that fired. https://prometheus.io/docs/alerting/latest/notifications/ is a good reference for alertmanager templating
Before | After |
---|---|
Slack PagerDuty |
Slack PagerDuty |
Update silence button
Update the silence button to silence all the firing alerts for that type
instead of a single alert, for example
Why
As seen in
gitlab-com/gl-infra&746 (closed) we
have multiple alerts paging the SRE on-call in a few seconds for the
same service. Instead of sending multiple pages which can be stressful,
distracting and unclear which one to take action on, send only 1 per
type
Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15765
Testing
-
Check that alertmanager configuration was updated. -
Check that the templates where updated correctly. -
Check that the slack message is correct. -
Check that the pagerduty message is correct. -
Check that alerts in elastic search are correct