Calculating error budget for a stage group for a single component
To get going on &437 we should make sure we can calculate the error budget for a single component (GitLab-rails) for a single stage group (doesn't really matter which one).
Required source metrics
-
error rate over 1h per feature category: gitlab:component:feature_category:execution:error:rate_1h{component="puma"} -
total measurement rate for error rate over 1h per feature category: gitlab:component:feature_category:execution:ops:rate_1h{component="puma}" -
apdex success rate over 1h per feature category: 'gitlab:component:feature_category:execution:apdex:success:rate_1h{component="puma"}(gitlab-com/runbooks!3348 (merged)) -
total apdex measurement rate over 1h per feature category: gitlab:component:feature_category:execution:apdex:weight:score_1h{component="puma"} -
gitlab:feature_category:stage_group:mappingto map these metrics to stage groups
Proposal
New recordings based on the above source metrics (nothing in place yet)
The above gitlab:component:feature_category:* get an equivalent gitlab:component:stage_group:* recording that sums up all the feature categories for that group, but keep the other labels.
| Feature Category Recording | Stage Group Recording |
|---|---|
gitlab:component:feature_category:execution:error:rate_%s |
gitlab:component:stage_group:execution:error:rate_%s |
gitlab:component:feature_category:execution:ops:rate_%s" |
gitlab:component:stage_group:execution:ops:rate_%s" |
'gitlab:component:feature_category:execution:apdex:success:rate_%s |
'gitlab:component:stage_group:execution:apdex:success:rate_%s |
gitlab:component:stage_group:execution:apdex:weight:score_%s |
gitlab:component:stage_group:execution:apdex:weight:score_%s |
These recordings could be part the feature category aggregation set in the runbooks.
Error budget calculation for a stage group:
(
# the number of operations with a satisfactory apdex
sum_over_time(gitlab:component:stage_group:execution:apdex:success:rate_1h[30d])
+
(
# the number of operations without errors
sum_over_time(gitlab:component:stage_group:execution:ops:rate_1h[30d]) -
sum_over_time(gitlab:component:stage_group:execution:error:rate_1h[30d])
)
) / (
# the total number of apdex measurements
sum_over_time(gitlab:component:stage_group:execution:apdex:weight:score_1h[30d]) +
# The total number of operations
sum_over_time(gitlab:component:stage_group:execution:ops:rate_1h[30d]) +
)
For this we're taking the 1h time interval because that's the recording that was not upscaled, and we're using it over 30d, to get a monthly budget
The result is a percentage, that signifies: the percentage of operations (SLI measurements, not actual requests) that were satisfactory.
To turn this into a minute budget over the past 30d we could show this on a dashboard
(1 - <ratio mentioned before>) * (30 * 24 * 60) # number of minutes in a month
Exit criteria
We have an error budget in percentage for all feature categories emitted from GitLab-rails for a stage group. The only requirement for the stage group is that metrics for their feature categories are emitted.