fix: shard_import SLO
What
Reduce the SLO for the shard_imports
from 99.5
to 95
.
Why
In gitlab-com/gl-infra/production#8000 (closed) the on-call got paged because 4 imports failed. This shard has low traffic, in this case, a small number of imports failed and it paged the on-call and recovered within minutes, and it wasn't very actionable for the on-call.
Read https://sre.google/workbook/alerting-on-slos/#low-traffic-services-and-error-budget-alerting
The multiwindow, multi-burn-rate approach just detailed works well when a sufficiently high rate of incoming requests provides a meaningful signal when an issue arises. However, these approaches can cause problems for systems that receive a low rate of requests. If a system has either a low number of users or natural low-traffic periods (such as nights and weekends), you may need to alter your approach.
It’s harder to automatically distinguish unimportant events in low-traffic services. For example, if a system receives 10 requests per hour, then a single failed request results in an hourly error rate of 10%. For a 99.9% SLO, this request constitutes a 1,000x burn rate and would page immediately, as it consumed 13.9% of the 30-day error budget. This scenario allows for only seven failed requests in 30 days. Single requests can fail for a large number of ephemeral and uninteresting reasons that aren’t necessarily cost-effective to solve in the same way as large systematic outages.
The best solution depends on the nature of the service: what is the impact of a single failed request?
I don't think the impact of 1 single import failing is page worthy.
Looking at the proposed dashboard analysis if we reduce this to 95
it
wouldn't have paged the on-call for small errors like these:
Reference: gitlab-com/gl-infra/production#8000 (closed)