All errors: A new type of error SLI
The Problem
Sometimes a misconfiguration or deployment failure results in the GitLab application incorrectly attributing a problem to a client (4xx error) when in fact it is a server-side issue.
This could either be a permission error (401, 403) or a missing resource error (404).
However, when this happens we don't have a great monitoring story. Since we treat these as client errors, we effectively ignore the failures, and need clients to tell us that the system has failed.
This means that an SLI could be firing 100% 4xx errors, with no successes, and no alert is fired.
For example, for Dedicated, a misconfiguration led to all HTTPS git push requests failing with a 403 until the problem was reported by the client. This should have been caught earlier (by the Dedicated operations team) but was not as it was reporting as a client side error. No requests succeeded during this period.
More details in https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/197
Monitoring Client Errors
In general, we don't include client side (4xx) errors in our SLI, for good reason.
One workaround might be to introduce a new measurement, alongside the existing Error and Apdex SLI. Naming is hard, but lets use "AllErrors" for now.
apdex: successCounterApdex(
successRateMetric='gitlab_sli_rails_request_apdex_success_total',
operationRateMetric='gitlab_sli_rails_request_apdex_total',
selector=baseSelector,
),
requestRate: rateMetric(
counter='http_requests_total',
selector=baseSelector,
),
errorRate: rateMetric(
counter='http_requests_total',
selector=baseSelector { status: { re: '5..' } }
),
allErrorRate: rateMetric(
counter='http_requests_total',
selector=baseSelector { status: { re: '5..|4..' } }
),
This adds allErrorRate
. allErrorRate
includes 4xx (client) and 5xx (server) errors.
This would be an additional SLI, alongside errorRate
. This SLI would not be aggregated, and would not contribute towards error budgets.
It would be evaluated for a very high failure rate only, eg: 80% errors over a 1h/5m window.
Additionally, there would be no 6h/30m window.
If a configuration problem occurred, and 80% of all requests for a specific SLI started failing, an alert would trigger.
Since the SLOs are very low (eg, 20% success rate, or lower) some work would need to be done on calculating the appropriate thresholds.