Sign in or sign up before continuing. Don't have an account yet? Register now to get started.
Register now

Support setting alerts for percentage and percentile thresholds

Problem to solve

Alerts can currently only be set up for absolute values for a given metric.

More useful would be to have the ability to create alerts for values that indicate something 'out of the norm', such as percentage or percentile value.

For example, imagine a metric that shows the number of 5xx errors occurring per second. We could classify the following alerts:

  • Absolute: Alerts based on a specific threshold value, such as "10 5xx errors occurred per second".
  • Percentage: Alerts where the threshold is a percentage value, such as "5xx error rate at 10% of total requests at this time".
  • Percentile: Alerts that are based on a historical average, such as "5xx error rate is in the 90th percentile".

Currently 'Absolute' is the only metric type available. This metric is brittle, because expected changes can cause the threshold to no longer be valid. For example - a gradual increase in traffic to a website might cause more 5xx errors to occur, even though the percentage of errors has gone down.

Percentage and Percentile are better indicators that something is 'out of the norm'. Such metrics would be useful as an initial warning that an incident may be occurring and that something needs to be investigated.

Intended users

  • Devon (DevOps Engineer)
  • Sidney (Systems Administrator)

Further details

Existing screen:

Screenshot_2020-01-22_at_21.09.38

The workaround for getting percentage or percentile metrics is to configure a custom metric that tracks such a value. For example - "percentage of 5xx errors from total requests". Once the alert is set up then the existing absolute alert configuration can be used.

Such metrics may be more complicated to set up a query for.

Therefore - this feature proposal exists to reduce the need for setting up multiple metrics to track different dimensions of the same thing.

Proposal

Add additional options to the existing alerts configuration screens.

This is handled in Grafana with a 'Conditions' query that allows selection of various simple calculations based off the value of given metric over time:

Screenshot_2020-01-22_at_21.07.42

Performance would need to be considered carefully. Running additional calculations over metric data may prove to be expensive for large deployments.

Permissions and Security

This would be restricted to those who can already set alerts.

Documentation

Availability & Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Links / references

Edited Jul 05, 2020 by Dov Hershkovitch
Assignee Loading
Time tracking Loading