Don't aggregate error rates to service when not aggregating operation rates to service

Extracted from !3068 (merged), in order to reduce the scale of that change.

Required for !3119 (merged)

Background

When defining SLIs in the metrics-catalog, there is an option to disable certain operation-rates from being aggregated to the service level. This is done with the aggregateRequestRate attribute, for example":


    secondary_servers: {
      userImpacting: true,  // userImpacting for data redundancy reasons
      featureCategory: 'not_owned',
      team: 'sre_observability',
      
      ...
      aggregateRequestRate: false,
    },

This is used in a few places on less important SLIs, such as replica redis ops/second etc, where including the operations-rate is noisy, confusing and doesn't add value.

Unfortunately, there is a bug in the current implementation.

Although we exclude operation-rates, we continue to include error rates in the aggregated service metrics.

This is problematic, because:

error ratio = error rate / operation rate

For aggregateRequestRate: false services, we continue to include errors in the service-aggregations, while not including the operations. While this has never been a problem, it could lead to false alarms when an non-aggregated SLI generates a lot of errors, or even situations where the error rate recorded at the service level exceeds 100%.

These errors could cause the service to violate it error rate SLO, leading to inaccurate alerts.

Note that we continue to alert on the SLI non-aggregated, this change just excludes errors from being aggregated to the service level as we already do with operations.

Overall, this leads to a less surprising and in the long term, easier to maintain configuration.

Method

Implementing this is fairly straight forward:

  1. Rename the aggregate_rps label on SLI mapping to aggregate_to_service, since that better describes the attribute ("is this SLI included in service-level aggregations"?)
  2. On error-rate recording rules, filter the aggregation on aggregate_to_service="yes" SLIs only
Edited by Andrew Newdigate

Merge request reports

Loading