Skip to content

fix: shard_import SLO

Steve Xuereb requested to merge fix/shard_imports-alert into master

What

Reduce the SLO for the shard_imports from 99.5 to 95.

Why

In gitlab-com/gl-infra/production#8000 (closed) the on-call got paged because 4 imports failed. This shard has low traffic, in this case, a small number of imports failed and it paged the on-call and recovered within minutes, and it wasn't very actionable for the on-call.

Read https://sre.google/workbook/alerting-on-slos/#low-traffic-services-and-error-budget-alerting

The multiwindow, multi-burn-rate approach just detailed works well when a sufficiently high rate of incoming requests provides a meaningful signal when an issue arises. However, these approaches can cause problems for systems that receive a low rate of requests. If a system has either a low number of users or natural low-traffic periods (such as nights and weekends), you may need to alter your approach.

It’s harder to automatically distinguish unimportant events in low-traffic services. For example, if a system receives 10 requests per hour, then a single failed request results in an hourly error rate of 10%. For a 99.9% SLO, this request constitutes a 1,000x burn rate and would page immediately, as it consumed 13.9% of the 30-day error budget. This scenario allows for only seven failed requests in 30 days. Single requests can fail for a large number of ephemeral and uninteresting reasons that aren’t necessarily cost-effective to solve in the same way as large systematic outages.

The best solution depends on the nature of the service: what is the impact of a single failed request?

I don't think the impact of 1 single import failing is page worthy.

Looking at the proposed dashboard analysis if we reduce this to 95 it wouldn't have paged the on-call for small errors like these:

Screenshot_2022-11-05_at_09.01.55

source

Reference: gitlab-com/gl-infra/production#8000 (closed)

Edited by Steve Xuereb

Merge request reports