Alerting Improvements

As part of the MVC of our SLO alerts issue, feedback was passed along on further improvements we could make. I wanted to collect that here.

Feedback

I have some suggestions from the point of view of a potential user.... most of my ideas are about workflow of how to handle alerts.

My utopia of how this would work that when an alert is sent it should have a "create issue" and an "add to existing issue" the former would link to a page which creates an issue from a template (presumably with a label like "outage"). Then all the alerts that fired that were attached this way to that outage incident would be listed on the issue with their current state (and maybe a sparkline graph of their history or something). That would automatically create silences for the relevant alerts (which would require a bit of a UI to adjust the matching labels) until the issue is closed. When it's closed the silences automatically would be removed.

This lets issues drive the workflow and treats alerts as just a notification system rather than a parallel issue list. It also means once an incident is resolved you're left with an issue which includes data on all the alert notifications sent along with the work done on the issue. And it also ensures that silences are managed through the workflow with issues explaining why they exist and what needs to happen to remove them. It's really important to realize that an incident will normally trigger multiple alerts and that alerts fire repeated notifications so you can't treat alerts themselves as a kind of issue-lite. They have to be grouped together and many attached to a single outage issue.

From my experience in ops I would recommend treating the actual delivery of alert notifications outside this process. That is, gitlab would be a UI to view alertmanager state such as which alerts are or were firing, as well as to create and delete silences. But alertmanager would send alert notifications directly to email, slack, pagerduty, etc. It may be worth having a UI to generate that configuration but people will have lots of different quirky ideas of where to send notifications including slack, hangouts, irc, smtp, sms, voice calls, etc. There are a myriad of systems for managing this and the simplicity and reliability of them depends on a simple configuration with minimal dependencies. The link back to gitlab can be in the template independent for (almost) any channel though.


As an aside I think even if you're keeping the alert UI simple -- e.g. only handling comparing a single metric with a constant using a simple operator -- you should still mock it up assuming the metric has a few time series in it. That is, "HTTP Error Rate" may be a single metric but you may have half a dozen web servers reporting it so you'll have half a dozen time series and two of those may be above the threshold.

Assignee Loading
Time tracking Loading