Skip to content

fix: clamp occurence SLA ratio between 1 and 0

Bob Van Landuyt requested to merge bvl-clamp-occurence-sla into master

On 2023-01-09 we had some observability issues that caused some recordings to not complete. As a result, this means that the success-side (numerator) of the fraction could be higher than operation-side (denominator). This caused availability to be higher than 100%.

The opposite could also happen when we miss recordings for the ops-rate for error SLIs.

This doesn't solve incorrect recordings, but it does make numbers less ridiculous when they do happen, causing less confusion if they only happen for a brief period of time. I don't think this affects instances collect all metrics in a single Prometheus instance.

Noticed when looking into https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/750 and collecting data for gitlab-com/gl-infra/scalability#1322 (closed).

Before After
image image
https://dashboards.gitlab.net/d/general-occurence-slas/general-occurence-slas?orgId=1&from=1673229913042&to=1673350121738 https://dashboards.gitlab.net/dashboard/snapshot/jI71ezHxybFbXpin1v8vuLQv6bzPXUr9?orgId=1&from=1673229913042&to=1673350121738
Edited by Bob Van Landuyt

Merge request reports