Calculating error budget for a stage group for a single component

To get going on &437 we should make sure we can calculate the error budget for a single component (GitLab-rails) for a single stage group (doesn't really matter which one).

Required source metrics

  • error rate over 1h per feature category: gitlab:component:feature_category:execution:error:rate_1h{component="puma"}
  • total measurement rate for error rate over 1h per feature category: gitlab:component:feature_category:execution:ops:rate_1h{component="puma}"
  • apdex success rate over 1h per feature category: 'gitlab:component:feature_category:execution:apdex:success:rate_1h{component="puma"} (gitlab-com/runbooks!3348 (merged))
  • total apdex measurement rate over 1h per feature category: gitlab:component:feature_category:execution:apdex:weight:score_1h{component="puma"}
  • gitlab:feature_category:stage_group:mapping to map these metrics to stage groups

Proposal

New recordings based on the above source metrics (nothing in place yet)

The above gitlab:component:feature_category:* get an equivalent gitlab:component:stage_group:* recording that sums up all the feature categories for that group, but keep the other labels.

Feature Category Recording Stage Group Recording
gitlab:component:feature_category:execution:error:rate_%s gitlab:component:stage_group:execution:error:rate_%s
gitlab:component:feature_category:execution:ops:rate_%s" gitlab:component:stage_group:execution:ops:rate_%s"
'gitlab:component:feature_category:execution:apdex:success:rate_%s 'gitlab:component:stage_group:execution:apdex:success:rate_%s
gitlab:component:stage_group:execution:apdex:weight:score_%s gitlab:component:stage_group:execution:apdex:weight:score_%s

These recordings could be part the feature category aggregation set in the runbooks.

Error budget calculation for a stage group:

(
  # the number of operations with a satisfactory apdex
  sum_over_time(gitlab:component:stage_group:execution:apdex:success:rate_1h[30d])
  +
  ( 
    # the number of operations without errors
    sum_over_time(gitlab:component:stage_group:execution:ops:rate_1h[30d]) -
    sum_over_time(gitlab:component:stage_group:execution:error:rate_1h[30d]) 
  )
) / (
  # the total number of apdex measurements
  sum_over_time(gitlab:component:stage_group:execution:apdex:weight:score_1h[30d]) +
  # The total number of operations
  sum_over_time(gitlab:component:stage_group:execution:ops:rate_1h[30d]) +
)

For this we're taking the 1h time interval because that's the recording that was not upscaled, and we're using it over 30d, to get a monthly budget

The result is a percentage, that signifies: the percentage of operations (SLI measurements, not actual requests) that were satisfactory.

To turn this into a minute budget over the past 30d we could show this on a dashboard

(1 - <ratio mentioned before>) * (30 * 24 * 60) # number of minutes in a month

Exit criteria

We have an error budget in percentage for all feature categories emitted from GitLab-rails for a stage group. The only requirement for the stage group is that metrics for their feature categories are emitted.

Edited by Bob Van Landuyt