De-duplication of Prometheus alerts for Incidents

Problem to solve

Often times an alert can fire multiple times over the course of a single incident. Prometheus does support a lot of de-duplication and grouping, which is helpful. However it is possible for the same alert to resolve, then trigger again, when we already have an issue for it open.

We should detect this scenario to reduce the amount of noise and issue clutter, when automatic issue creation is enabled.

Intended users

Sasha the Software Developer
Devon the DevOps Engineer
Sidney the Systems Administrator

Further details

This work contributes to the Incident Management Vision

Proposal

Prometheus sends a groupKey which is a unique identifier for each alert group.

Before opening a new issue for a given alert, we should first check to see if an existing issue is already open for a given groupKey. If one is already open, we should add a comment that the alert triggered again rather than creating a new issue.

The main decision here I think will be how we link a groupKey to a specific incident. We have a few options:

  • Include the groupKey in the issue body or title, where we can search for it
  • Build an extra table in the database, linking the issues that have been opened for a given groupKey

Since this is largely an internal detail, we the latter option seems to make the most sense. However it does mean we will need to add a new database table to manage these.

Note: we should ensure we can track not just the most recent issue, but also the history. This way we can determine which issues have been opened for this alert in the past, and present them as possible hints to the operator trying to resolve the situation.

Permissions and Security

Documentation

Testing

What does success look like, and how can we measure that?

Links / references

/label ~feature

Edited Aug 22, 2019 by Sarah Waldner
Assignee Loading
Time tracking Loading