Generates error ratios in a consistent manner
This is a small change extracted from !3068 (merged), to reduce the complexity of that MR.
Why?
GitLab's SLI monitoring uses error ratios for monitoring errors, that is "What is the ratio of requests in error to all requests".
In different places, this is calculated differently.
In some places, we preaggregate the number of errors and the number of operations into recording rules and then divide those numbers by one another.
- record: gitlab_component_errors:rate
expr: |
sum by (env, environment, tier, type, stage, component) (
gitlab_component_errors:rate{monitor!="global"}
)
- record: gitlab_component_ops:rate
expr: |
sum by (env, environment, tier, type, stage, component) (
gitlab_component_ops:rate{monitor!="global"}
)
- record: gitlab_component_errors:ratio
expr: |
gitlab_component_errors:rate{monitor="global"}
/
gitlab_component_ops:rate{monitor="global"}
In other places, the recording rule is calculated directly, for example:
- record: gitlab_service_errors:ratio
expr: |
sum by (environment, env, tier, type, stage) (gitlab_component_errors:rate{monitor!="global"} >= 0)
/
sum by (environment, env, tier, type, stage) (gitlab_component_ops:rate{monitor!="global"} > 0)
This makes all the Error Ratio recording rules consistent, and is yakshaving required for !3068 (merged).
Outcomes
A few extra recording rules are added, but none are removed, and the change has no functional differences from before.
Edited by Andrew Newdigate