De-duplication of Prometheus alerts for Incidents
Problem to solve
Often times an alert can fire multiple times over the course of a single incident. Prometheus does support a lot of de-duplication and grouping, which is helpful. However it is possible for the same alert to resolve, then trigger again, when we already have an issue for it open.
We should detect this scenario to reduce the amount of noise and issue clutter, when automatic issue creation is enabled.
Intended users
Sasha the Software Developer
Devon the DevOps Engineer
Sidney the Systems Administrator
Further details
This work contributes to the Incident Management Vision
Original Proposal
Prometheus sends a groupKey
which is a unique identifier for each alert group.
Before opening a new issue for a given alert, we should first check to see if an existing issue is already open for a given groupKey
. If one is already open, we should add a comment that the alert triggered again rather than creating a new issue.
The main decision here I think will be how we link a groupKey
to a specific incident. We have a few options:
- Include the
groupKey
in the issue body or title, where we can search for it - Build an extra table in the database, linking the issues that have been opened for a given
groupKey
Since this is largely an internal detail, we the latter option seems to make the most sense. However it does mean we will need to add a new database table to manage these.
Note: we should ensure we can track not just the most recent issue, but also the history. This way we can determine which issues have been opened for this alert in the past, and present them as possible hints to the operator trying to resolve the situation.
WIP Proposal
Before opening a new issue for a given alert, we should first check to see if an existing issue is already open for a given groupKey
. If one is already open, we will add the new alert to an alert counter.
Additional details
- Basic idea: simulate an alert counter in the issue comments
- When the issue is created from a Prometheus alert, the alert bot would immediately post a comment saying, "Alert counter: 1"
- This reply would be updated when subsequent alerts come in (ie, the counter would increase, 1-2-3-4, etc)
- Editing a comment would not generate a system note (or an email notification) so users won't be bombarded with tons of additional notifications
- Counter would be stored in the database (so, if a comment is accidentally deleted, users wouldn't lose the alert count)
- The counter would be the first comment in the issue so that should make it easier to know which comment needs editing
- As per normal comment functionality, the time stamp would update after every edit. So, users would have a sense of when the last alert was recorded.
As part of this work, we also discussed adding the GroupKey and the
- Add the GroupKey and the generatorURL into the issue description, with the rest of the alert details. This will help users to go back to the Prometheus web UI to investigate subsequent alerts, if/as needed.
Permissions and Security
Documentation
Testing
What does success look like, and how can we measure that?
Links / references
/label gitlab-ce~10230929