Don't aggregate error rates to service when not aggregating operation rates to service
Extracted from !3068 (merged), in order to reduce the scale of that change.
Required for !3119 (merged)
Background
When defining SLIs in the metrics-catalog, there is an option to disable certain operation-rates from being aggregated to the service level. This is done with the aggregateRequestRate attribute, for example":
secondary_servers: {
userImpacting: true, // userImpacting for data redundancy reasons
featureCategory: 'not_owned',
team: 'sre_observability',
...
aggregateRequestRate: false,
},
This is used in a few places on less important SLIs, such as replica redis ops/second etc, where including the operations-rate is noisy, confusing and doesn't add value.
Unfortunately, there is a bug in the current implementation.
Although we exclude operation-rates, we continue to include error rates in the aggregated service metrics.
This is problematic, because:
error ratio = error rate / operation rate
For aggregateRequestRate: false services, we continue to include errors in the service-aggregations, while not including the operations. While this has never been a problem, it could lead to false alarms when an non-aggregated SLI generates a lot of errors, or even situations where the error rate recorded at the service level exceeds 100%.
These errors could cause the service to violate it error rate SLO, leading to inaccurate alerts.
Note that we continue to alert on the SLI non-aggregated, this change just excludes errors from being aggregated to the service level as we already do with operations.
Overall, this leads to a less surprising and in the long term, easier to maintain configuration.
Method
Implementing this is fairly straight forward:
- Rename the
aggregate_rpslabel on SLI mapping toaggregate_to_service, since that better describes the attribute ("is this SLI included in service-level aggregations"?) - On error-rate recording rules, filter the aggregation on
aggregate_to_service="yes"SLIs only