De-duplication of Prometheus alerts for Incidents
Problem to solve
Often times an alert can fire multiple times over the course of a single incident. Prometheus does support a lot of de-duplication and grouping, which is helpful. However it is possible for the same alert to resolve, then trigger again, when we already have an issue for it open.
We should detect this scenario to reduce the amount of noise and issue clutter, when automatic issue creation is enabled.
Intended users
Sasha the Software Developer
Devon the DevOps Engineer
Sidney the Systems Administrator
Further details
This work contributes to the Incident Management Vision
Proposal
Prometheus sends a groupKey which is a unique identifier for each alert group.
Before opening a new issue for a given alert, we should first check to see if an existing issue is already open for a given groupKey. If one is already open, we should add a comment that the alert triggered again rather than creating a new issue.
The main decision here I think will be how we link a groupKey to a specific incident. We have a few options:
- Include the
groupKeyin the issue body or title, where we can search for it - Build an extra table in the database, linking the issues that have been opened for a given
groupKey
Since this is largely an internal detail, we the latter option seems to make the most sense. However it does mean we will need to add a new database table to manage these.
Note: we should ensure we can track not just the most recent issue, but also the history. This way we can determine which issues have been opened for this alert in the past, and present them as possible hints to the operator trying to resolve the situation.
Permissions and Security
Documentation
Testing
What does success look like, and how can we measure that?
Links / references
/label ~feature