Extracted from !3068 (merged), in order to reduce the scale of that change.

Background

When defining SLIs in the metrics-catalog, there is an option to disable certain operation-rates from being aggregated to the service level. This is done with the aggregateRequestRate attribute, for example":


    secondary_servers: {
      userImpacting: true,  // userImpacting for data redundancy reasons
      featureCategory: 'not_owned',
      team: 'sre_observability',
      
      ...
      aggregateRequestRate: false,
    },

This is used in a few places on less important SLIs, such as replica redis ops/second etc, where including the operations-rate is noisy, confusing and doesn't add value.

Unfortunately, there is a bug in the current implementation.

Although we exclude operation-rates, we continue to include error rates in the aggregated service metrics.

This is problematic, because:

error ratio = error rate / operation rate

For aggregateRequestRate: false services, we continue to include errors in the service-aggregations, while not including the operations. While this has never been a problem, it could lead to false alarms when an non-aggregated SLI generates a lot of errors, or even situations where the error rate recorded at the service level exceeds 100%.

These errors could cause the service to violate it error rate SLO, leading to inaccurate alerts.

Note that we continue to alert on the SLI non-aggregated, this change just excludes errors from being aggregated to the service level as we already do with operations.

Overall, this leads to a less surprising and in the long term, easier to maintain configuration.

Method

Implementing this is fairly straight forward:

Rename the aggregate_rps label on SLI mapping to aggregate_to_service, since that better describes the attribute ("is this SLI included in service-level aggregations"?)
On error-rate recording rules, filter the aggregation on aggregate_to_service="yes" SLIs only

Edited Jan 19, 2021 by Andrew Newdigate

Don't aggregate error rates to service when not aggregating operation rates to service

Background

Method

Merge request reports