Add the recording rules required for multiburn multiwindow error rates

Part of gitlab-com/gl-infra/scalability#52

This is best described in these two resources:

What does this change do?

Currently, we record per-minute error-rate and per-minute operation-rate for each service in the application.

Moving to multiwindow, multiburn error alerting provides many advantages, as described in https://landing.google.com/sre/workbook/chapters/alerting-on-slos/.

This change is part of the migration to multiwindow, multiburn error alerting.

For the initial rollout, we will use the values suggested by Google in the SRE Book:

Severity Long window Short window Burn rate Error budget consumed
Page 1 hour 5 minutes 14.4 2%
Page 6 hours 30 minutes 6 5%

Notice that we are only going with two of the three tiers suggested. Once we are satisfied with these, we can roll out the longer window periods (3day, 6 hours)

For these window/burn rate errors, we need to record error rates over the following windows: 1h, 5m, 6h, 30m.

This change updates the recording rule renderer for the metrics catalog to include service rates over these windows.

Once this change is rolled in, we will add the alerts, but having the recording rules in place first will allow us to experiment with the values to ensure that the rates work for our workloads.

Edited by Andrew Newdigate

Merge request reports

Loading