Don't include operation rates for SLIs without an error rate in error budgets for stage groups
From the conversation in #1404 (comment 804361493)
What happened
In gitlab-com/runbooks!4020 (merged) (part of &525 (closed)) we started feeding all SLIs defined in the service catalog in the error budget for stage groups. However, it went unnoticed that some of these SLIs have an operation rate without having an error rate. This meant that a few groups had skewed results on their dashboards because operations - missing_error_rate
for the success rate for the error component would turn out to be 0 for that SLI. (For example: #1399 (closed)).
In gitlab-com/runbooks!4096 (merged) I worked around that by replacing the error rate with 0 * operations
if the error rate was missing.
This was incorrect: This is the way we work around missing source metrics, but missing recordings should not be handled that way. The side effect of this was that all of the traffic for these SLIs would be treated as a successful operation in the error budget. We don't know how this traffic has affected users, so we should not include it in the budget at all. The result of this was that for some groups (listed below), these metrics would positively but incorrectly affect the availability.
One of the affected SLIs is rails_requests
: this SLI only has an apdex and an operation rate. So the operation rate would incorrectly skew the total number for groups that have opted in.
The solution
Instead of including marking the "missing" errors as 0, we take out the operations. Meaning that this particular component would not affect the error budget any more.
- gitlab-com/runbooks!4193 (merged) corrects the query for Sisense (available 1 day later, not retroactive), and the dashboards (available immediately)
- infradev-report!28 (merged) corrects the query for the infradev report (available on the current report after a few hours, not on retroactive reports).
The effect
No action is required. The groups listed here can expect their availability number to drop slightly. This shows the order of magnitude to expect based on the numbers of 2022-01-01
.
When taking these invalid operation rates out of the error budget for stage groups calculation would be for 2022-01-01 00:00
:
Element | Value (current) | Value (corrected: without errors) | Difference | Affected components | EM |
---|---|---|---|---|---|
{stage_group="global_search"} | 99.9939% | 99.8813% | -0.1126% | elasticsearch_indexing, elasticsearch_searching | @changzhengliu |
{stage_group="not_owned"} | 99.7557% | 99.6742% | -0.0815% | rails_requests | |
{stage_group="package"} | 99.9799% | 99.9713% | -0.0087% | server_route_manifest_reads, server_route_manifest_writes | @michelletorres |
{stage_group="pipeline_authoring"} | 99.9232% | 99.8975% | -0.0257% | rails_requests | @marknuzzo |
{stage_group="pipeline_execution"} | 99.9920% | 99.9878% | -0.0042% | queuing_queries_duration, primary_server, secondary_servers | @avielle |
{stage_group="release"} | 99.9437% | 99.9348% | -0.0088% | server_headers | @nicolewilliams |
{stage_group="workspace"} | 99.9462% | 99.9421% | -0.0041% | imagescaler | @mksionek |