Potentially Incorrect Rate Calculation Using aws_es_2xx_average
NOTE: This issue is actually for project https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/tenant-observability-config, but the issues are disabled there.
I know this is a bit ugly, but did want to contribute our findings back to you, GitLab!
Below is an AI generated explanation based on our internal doc we have written where we take this query as an example.
Summary
The following PromQL query used in the metrics catalog may produce incorrect request rates depending on the CloudWatch SampleCount behavior of the underlying metric:
avg_over_time(aws_es_2xx_average{type="logging"}[5m]) / 60
This query assumes that the CloudWatch Average statistic represents “requests per minute”, which is not generally true for counter-like CloudWatch metrics.
Why This Is Potentially Incorrect
In CloudWatch, the Average statistic is defined as:
Average = Sum / SampleCount
This means:
-
Averageis not normalized by time - It is normalized by number of samples
-
SampleCountis service-specific and not guaranteed to be constant
Dividing Average by 60 therefore only produces a correct per-second rate if SampleCount == 1.
Why This Currently Appears to Work
For AWS OpenSearch Service, we observed that:
-
SampleCountfor the2xxrequest metric is currently always1 -
As a result:
Average == Sum -
In this specific case, dividing by
60yields a correct rate by coincidence
However, this behavior is not guaranteed by AWS and differs across services.
Why This Breaks for Other Metrics
When applying the same pattern to ALB request count metrics, for example:
-
SampleCounttypically ranges between 70–80 - Varies over time
In that case:
Average = Sum / SampleCount
Dividing by 60 produces a rate that is significantly lower than reality.
This makes the query unsafe as a general pattern for counter-like CloudWatch metrics.
Recommended Safer Pattern
For counter-like CloudWatch metrics (requests, invocations, bytes), the more robust approach is to use the Sum statistic and divide by the CloudWatch period length:
avg_over_time(aws_es_2xx_sum{type="logging"}[5m]) / 60
This produces a correct per-second rate regardless of SampleCount.
Why This Matters
- The current query works only under a specific
SampleCountbehavior - Users copying this example for other AWS services will get incorrect results
- The failure mode is silent and hard to detect
Suggestion
Clarify or update the example to either:
- Explicitly rely on
SampleCount == 1(and document this assumption), or - Prefer the
*_summetric for rate calculations to avoid ambiguity