Potentially Incorrect Rate Calculation Using aws_es_2xx_average

NOTE: This issue is actually for project https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/tenant-observability-config, but the issues are disabled there.

I know this is a bit ugly, but did want to contribute our findings back to you, GitLab!

Below is an AI generated explanation based on our internal doc we have written where we take this query as an example.


Summary

The following PromQL query used in the metrics catalog may produce incorrect request rates depending on the CloudWatch SampleCount behavior of the underlying metric:

avg_over_time(aws_es_2xx_average{type="logging"}[5m]) / 60

Source: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/tenant-observability-config/-/blob/main/metrics-catalog/aws/logging.libsonnet#L30

This query assumes that the CloudWatch Average statistic represents “requests per minute”, which is not generally true for counter-like CloudWatch metrics.


Why This Is Potentially Incorrect

In CloudWatch, the Average statistic is defined as:

Average = Sum / SampleCount

This means:

  • Average is not normalized by time
  • It is normalized by number of samples
  • SampleCount is service-specific and not guaranteed to be constant

Dividing Average by 60 therefore only produces a correct per-second rate if SampleCount == 1.


Why This Currently Appears to Work

For AWS OpenSearch Service, we observed that:

  • SampleCount for the 2xx request metric is currently always 1

  • As a result:

    Average == Sum

  • In this specific case, dividing by 60 yields a correct rate by coincidence

However, this behavior is not guaranteed by AWS and differs across services.


Why This Breaks for Other Metrics

When applying the same pattern to ALB request count metrics, for example:

  • SampleCount typically ranges between 70–80
  • Varies over time

In that case:

Average = Sum / SampleCount

Dividing by 60 produces a rate that is significantly lower than reality.

This makes the query unsafe as a general pattern for counter-like CloudWatch metrics.


For counter-like CloudWatch metrics (requests, invocations, bytes), the more robust approach is to use the Sum statistic and divide by the CloudWatch period length:

avg_over_time(aws_es_2xx_sum{type="logging"}[5m]) / 60

This produces a correct per-second rate regardless of SampleCount.


Why This Matters

  • The current query works only under a specific SampleCount behavior
  • Users copying this example for other AWS services will get incorrect results
  • The failure mode is silent and hard to detect

Suggestion

Clarify or update the example to either:

  • Explicitly rely on SampleCount == 1 (and document this assumption), or
  • Prefer the *_sum metric for rate calculations to avoid ambiguity