Allow aggregation sets to transactionally record ratios
In #2445 (closed) we've seen that a missing metric from a certain recording can cause miscalculated ratios, this problem is made worse when we use these recording rules in sum_over_time
calculations over a bigger range of time. This happened often in the past because of recording rules failing to evaluate at one of the intermediate levels.
To work around this, we should allow aggregation-sets to generate transactionally consistent rules for the numerator and denominator of those fractions. To make sure that a numerator and denominator are always either both present, or both missing, we need to record them in a single recording rule. We could allow this as follows:
stageGroupSLIs: aggregationSet.AggregationSet({
id: 'stage_groups',
name: 'Stage Group Metrics',
intermediateSource: false,
selector: { monitor: 'global' },
labels: ['env', 'environment', 'stage', 'stage_group', 'product_stage'],
generateSLODashboards: false,
joinSource: {
metric: 'gitlab:feature_category:stage_group:mapping',
selector: { monitor: 'global' },
on: ['feature_category'],
labels: ['stage_group', 'product_stage'],
},
metricFormats: {
- apdexSuccessRate: 'gitlab:stage_group:execution:apdex:success:rate_%s',
- apdexWeight: 'gitlab:stage_group:execution:apdex:weight:score_%s',
+ apdexRates: 'gitlab:stage_group:execution:apdex:rates_%s',
apdexRatio: 'gitlab:stage_group:execution:apdex:ratio_%s',
- opsRate: 'gitlab:stage_group:execution:ops:rate_%s',
- errorRate: 'gitlab:stage_group:execution:error:rate_%s',
+ errorRates: 'gitlab:stage_group:execution:rates_%s'
errorRatio: 'gitlab:stage_group:execution:error:ratio_%s',
},
}),
This will then generate recording rules like this:
- record: 'gitlab:stage_group:execution:apdex:rates_5m'
expr: |
label_replace(
sum by (env,environment,stage,stage_group,product_stage) (
(gitlab:component:feature_category:execution:apdex:rates_5m{recorded_rate="apdex_weight"} >= 0)
* on(feature_category) group_left(product_stage,stage_group) (
group by (feature_category,product_stage,stage_group) (gitlab:feature_category:stage_group:mapping{monitor="global"})
)
),
'recorded_rate', 'apdex_weight'
)
or
label_replace(
sum by (env,environment,stage,stage_group,product_stage) (
(gitlab:component:feature_category:execution:apdex:rates_5m{recorded_rate="apdex_success"} >= 0)
* on(feature_category) group_left(product_stage,stage_group) (
group by (feature_category,product_stage,stage_group) (gitlab:feature_category:stage_group:mapping{monitor="global"})
)
),
'recorded_rate', 'apdex_success'
)
For source metrics a recording could look like this:
- record: 'gitlab:feature_category:execution:apdex:rates_5m'
expr: |
label_replace(
sum by (env,environment,tier,type,stage,component,feature_category)(
rate(gitlab_sli_rails_request_apdex_success_total{job="gitlab-rails",type="web"}[5m])
),
'recorded_rate', 'apdex_success'
)
or
label_replace(
sum by (env,environment,tier,type,stage,component,feature_category)(
rate(gitlab_sli_rails_request_apdex_total{job="gitlab-rails",type="web"}[5m])
),
'recorded_rate', 'apdex_total'
)
This will allow us to use these rules in aggregation chains and long-range aggregations knowing that for each numerator datapoint there will be a denominiator datapoint. If they are missing, they will be missing on both sides of the fraction.
A ratio recording rule would then look like this:
- record: 'gitlab:stage_group:execution:apdex:ratio_5m'
expr: |
sum by (env,environment,stage,stage_group,product_stage) (
gitlab:stage_group:execution:apdex:rates_5m{recorded_rate="apdex_success"}
)
/
sum by (env,environment,stage,stage_group,product_stage) (
gitlab:feature_category:execution:apdex:rates_5m{recorded_rate="apdex_total"}
)
Caveat: A lot of metrics are not always initialized on startup. This is generally manageable for apdex metrics, which use a success-rate: the metric will be available as soon as a metric meets it's target. But for error rates we need to build in a safeguard as follows:
- record: 'gitlab:feature_category:execution:rates_5m'
expr: |
label_replace(
sum by (env,environment,tier,type,stage,component,feature_category)(
rate(gitlab_sli_rails_request_error_total{job="gitlab-rails",type="web"}[5m])
) or 0 * sum by (env,environment,tier,type,stage,component,feature_category)(
rate(gitlab_sli_rails_request_total{job="gitlab-rails",type="web"}[5m])
),
'recorded_rate', 'error_rate'
)
or
label_replace(
sum by (env,environment,tier,type,stage,component,feature_category)(
rate(gitlab_sli_rails_request_total{job="gitlab-rails",type="web"}[5m])
),
'recorded_rate', 'ops_rate'
)
This will ensure we record a 0 when the metric is missing for the error rates. If the missing metric is caused by a failure to scrape or record, this will have a positive effect on availability. We do this for error-rates, because the error rate could be low enough to otherwise never produce a series. For apdex we don't need this, as we hopefully have mostly operations that meet the apdex target and then the series is available quickly.
For metrics named gitlab_sli*
, application
SLIs, we
generally correctly initialize metrics.
When this is implemented and recordings transactional recordings are occurring, we need to make sure we use these transactional metrics in the query we use for error budgets for stage groups as well: https://gitlab.com/gitlab-com/runbooks/blob/31a185bbb798c0568f39b94c92519de2c19aba70/libsonnet/stage-groups/error-budget/queries.libsonnet#L15
When we implement this, we need to take into account #2482 (closed), if that is already implemented, we should take that offset into account when generating rules.