feat: group alerts by service/type
This attempted "failed", second attempted in !4715 (merged)
What
Group alerts by service
Group alerts by the type
label using Alertmanager grouping to only send one alert per service.
alertmanager-grouping.excalidraw
Update template for slack notification
Update alert template for slack notification to specify which service is firing and the list of alerts that fired. https://prometheus.io/docs/alerting/latest/notifications/ is a good reference for alertmanager templating
Before | After |
---|---|
Slack PagerDuty |
Slack PagerDuty |
Update silence button
Update the silence button to silence all the firing alerts for that type
instead of a single alert, for example
Why
As seen in
gitlab-com/gl-infra&746 (closed) we
have multiple alerts paging the SRE on-call in a few seconds for the
same service. Instead of sending multiple pages which can be stressful,
distracting and unclear which one to take action on, send only 1 per
type
Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15765
Testing
-
Check that alertmanager configuration was updated. -
Check that the templates where updated correctly. -
Check that the slack message is correct. -
Check that the pagerduty message is correct. -
Check that alerts in elastic search are correct 👉 https://nonprod-log.gitlab.net/goto/2c6ee400-e8cc-11ec-b771-57a829f2c394