Spike deduplicating alerts for a service
Summary
Each service has multiple SLIs for each component for example web-pages
, and these SLIs are needed, however when one of the SLOs trigger an alert we end up triggering all SLOs for that SLI in a span of 1 minutes, this means that the SRE on call will get pages 3-4 times about the same service because of the same issue.
For example:
If we see in a matter of 10 minutes, 1 person got paged 12 times about the same incident which is very distracting and prevents the person on focusing on the real issue at hand.
End Goal:
- SRE on-call gets only 1 page for service until that page/incident is resolved they will not get paged about the same service.
- When
main
stage is violating SLO, they shouldn't get pages thatcny
stage is also violating SLO, that is assumed.
Possible solutions
Alertmanager Grouping (Easiest)
Grouping categorizes alerts of similar nature into a single notification. This is especially useful during larger outages when many systems fail at once and hundreds to thousands of alerts may be firing simultaneously.
Example: Dozens or hundreds of instances of a service are running in your cluster when a network partition occurs. Half of your service instances can no longer reach the database. Alerting rules in Prometheus were configured to send an alert for each service instance if it cannot communicate with the database. As a result hundreds of alerts are sent to Alertmanager.
As a user, one only wants to get a single page while still being able to see exactly which service instances were affected. Thus one can configure Alertmanager to group alerts by their cluster and alertname so it sends a single compact notification.
Grouping of alerts, timing for the grouped notifications, and the receivers of those notifications are configured by a routing tree in the configuration file.
example-of-alertmanager-grouping.excalidraw
before (no grouping) | after (grouping) |
---|---|
How we can use it
- Group everything by service
type
,env
, andenvironment
. With this can group all alerts that fire around the same time per service- groups regional alerts as well.
- groups
cny
as well.
This is the easiest change we can do and will reduce the pager storm EOC gets when 1 service is fully down.
Missing features
- Service dependencies (will be covered with inhibition
- Grouping of black-box alerts since labels are different, but we can update this.
Research
- Videos
- https://www.youtube.com/watch?v=jpb6fLQOgn4
-
https://www.youtube.com/watch?v=PUdjca23Qa4
👈 If you only have time to watch 1 video, watch this.
- Articles
- Resources
Inhibition (Medium)
Inhibition is a concept of suppressing notifications for certain alerts if certain other alerts are already firing.
Example: An alert is firing that informs that an entire cluster is not reachable. Alertmanager can be configured to > mute all other alerts concerning this cluster if that particular alert is firing. This prevents notifications for hundreds or thousands of firing alerts that are unrelated to the actual issue.
Inhibitions are configured through the Alertmanager's configuration file.
How we can use it
@reprazent provided a perfect example in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15721#note_947490158 where rails
wouldn't alert if there is patroni
alerts firing.
Research
PagerDuty (Hardest)
What we would need to do to change things
-
Update the Service Directory to be in sync with our service directory in the runbook as currently, we can't get grouping We also don't have the right license.
-
Some services gets grouped in the wrong way Screenshot_2022-05-18_at_13.54.15 automatically we can fix this. -
We need to buy more into PagerDuty, license and company wise. -
We manage pagerduty by hand and would need to use the terraform provider to better manage it, which is a higher barrier to entry (because of work)
Research
- Videos
- Aritcles
- Intelligent alerting grouping
- Content-Based Alert Grouping. Simialar to alertmanager grouping, and inhibit rules.
- Time-Based group. Might breakdown really quick if 2 services fail at the same time and might be unrelated, or at least it gives you a nice data point.
North stars
Alerts per day
Pages
Results
Created a new scoped epic &746 (closed)