Duplicate issues created for firing prometheus alerts with multiple failing series
Summary
Multiple issues are created for a notification with multiple alerts from a gitlab-managed prometheus instance. This may occur when a metric yields multiple series on a single panel.
Steps to reproduce
Contrived, easy repro option
- Go to Settings > Operations > Incidents
- Enable auto-creation of incident issues
- Go to Settings > Operations > Alerts
- Activate External Prometheus
- Trigger an alert using this test payload:
{
"version" : "4",
"groupKey": null,
"status": "firing",
"receiver": "",
"groupLabels": {},
"commonLabels": {},
"commonAnnotations": {},
"externalURL": "",
"alerts": [
{
"startsAt": "2020-08-18T14:58:44Z",
"generatorURL": "http://host?g0.expr=up",
"endsAt": null,
"status": "firing",
"labels": {},
"annotations": {
"title": "Test triplicate issues"
}
},
{
"startsAt": "2020-08-18T14:58:44Z",
"generatorURL": "http://host?g0.expr=up",
"endsAt": null,
"status": "firing",
"labels": {},
"annotations": {
"title": "Test triplicate issues"
}
},
{
"startsAt": "2020-08-18T14:58:44Z",
"generatorURL": "http://host?g0.expr=up",
"endsAt": null,
"status": "firing",
"labels": {},
"annotations": {
"title": "Test triplicate issues"
}
}
]
}
- These payload contains 3 alerts which have matching fingerprints, so they are considered one AlertMangement::Alert in gitlab (rightfully?). Click the new alert titled
Test triplicate issues
under Operations > Alerts - 3 issues will be created for the AlertMangement::Alert
Slower, more-likely-to-occur in the wild repro option
- Create a prometheus metric which returns multiple metrics (ex the query
up
will yield multiple series on a single panel) - Set an alert rule which will fire for multiple of the metrics
- The metric below has an alert rule of
up < 1
, which was firing for two of the jobs
- The metric below has an alert rule of
- Under Settings > Operations > Incidents, turn on auto-issue creation for alerts
- Wait 5 min for the alert to fire
- Visit the firing alert detail view under Operations > Alerts > click the alert
- See the multiple issues for the alert
Firing metric alert | duplicate issues |
---|---|
Example Project
- Alert: https://gitlab.com/gitlab-org/monitor/tanuki-inc/-/alert_management/248/details#/overview
- Metric: https://gitlab.com/gitlab-org/monitor/tanuki-inc/-/metrics?dashboard=config%2Fprometheus%2Fcommon_metrics.yml&group=Business%20metrics%20(Custom)&title=Up%20-%20Testing%20alert%20metrics&y_label=up
What is the current bug behavior?
Multiple issues can be created on alert notification for a single AlertMangement::Alert
What is the expected correct behavior?
A single issue should be created for a single AlertManagement::Alert
Possible fixes
Probably somewhere in app/services/projects/prometheus/alerts/notify_service.rb & app/services/alert_management/process_prometheus_alert_service.rb