Provide option to snooze Operations Dashboard Alerts
Problem to solve
When some alerts trigger, they may represent something that needs to be done soon, but not immediately. In these cases it would be helpful to the responders to have the ability to 'snooze' an alert for a certain amount of time with the confidence that they would be notified at a specified later date/time.
For instance - a file server gets 80% full, which triggers an alert, but there are still some weeks leeway until that file server performance becomes degraded.
Intended users
Further details
A relevant observation during the SRE Shadow was that SREs have a lot more alerts triggering than I expected, so the ability to snooze to reduce noise could be beneficial.
Q: Do similar tools have this feature?
A: Yes - Alertmanager and PagerDuty do:
- https://prometheus.io/docs/alerting/alertmanager/#silences
- https://support.pagerduty.com/docs/editing-incidents#section-snooze-an-incident
It is planned in Grafana but not implemented yet: https://github.com/grafana/grafana/issues/5856
Proposal
Alerts trigger emails and other events via webhooks.
It would be good to be able to
- view active alert and 'snooze' them so they do not reappear for a set amount of time (e.g. 1 week).
- 'snooze' an alert from the email / event that is generated by the alert.