Increase the default latency threshold for the apdex portion of the Error Budget
Background
Error Budgets are made up of two components: apdex and error.
For the apdex portion, we use a latency threshold to determine if a request is fast enough. This is currently 1s.
Not all endpoints have the same latency requirements. We are building the ability for each endpoint to define its own threshold in project &525 (closed).
Because this flexibility does not currently exist, when introducing Error Budgets to stage groups, there has been a request to increase the default latency threshold so that endpoints that are permitted to be slower will not be negatively impacted with their budget spend. As improvements are made in each stage group, we can then increase the default latency threshold.
Proposal
When we record latency measurements, we do not record the exact seconds that each request took. The duration is stored in buckets (for monitoring performance and storage reasons). The buckets are [0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
(found here).
We could use 2.5s or 5s as the default latency. The table below demonstrates using 5s as the default. (We used a 7 day calculation to make the data gathering easier)
Stage Group | Current availability (using 7 days) at 1s | Availability (using 7 days) at 5s | Improvement |
---|---|---|---|
source_code | 99.9060% |
99.9966% |
0.09% |
access | 99.9330% |
99.9743% |
0.04% |
code_review | 99.5516% |
99.9699% |
0.42% |
project_management | 99.0660% |
99.9421% |
0.88% |
global_search | 97.8226% |
99.3450% |
1.52% |
(detailed view of this data can be found on this issue: #1244 (closed))
If this approach is chosen, we recommend using the 5s bucket for the largest impact.