Don't record request duration for failing requests
Coming from gitlab-org/gitlab!62091 (comment 580655929) and the discussion in the team call:
We currently use two different metrics for measuring request apdex:, that have a slightly different implementation:
-
Gitlab::Metrics::RequestMiddleware
: Used for SLIsThis implementation tracks the duration of all requests that did not raise, so it includes "manually" rendered 5xx errors.
The scoring for service availability looks like this:
Fast Slow Success 2/2 1/2 Handled error 1/2 0/2 Unhandled server error 0/1 0/1 -
Gitlab::Metrics::Transaction
: Used for stage group error budgets.This implementation tracks the duration of all requests, regardless of the status.
The scoring for the error budget looks like this:
Fast Slow Success 2/2 1/2 Error 1/2 0/2
Proposal
In the short term (without changing these metrics), we want to not take make an apdex measurement for failing requests, where a failing request is anything resulting in a 5xx status code. This is the current situation. If we stopped measuring durations for 4xx requests in this iteration, we'd be taking out a bunch of very fast requests from the apdex. This could trigger alerts and we'd need to tread carefully. A better way to do this would be to introduce new metrics, and switch our SLIs over to those. (#1099 (closed))
So for error budgets and availability want to have the following scoring:
Fast | Slow | |
---|---|---|
Success | 2/2 | 1/2 |
Error | 0/1 | 0/1 |