Increase the default latency threshold for the apdex portion of the Error Budget

Background

Error Budgets are made up of two components: apdex and error.

For the apdex portion, we use a latency threshold to determine if a request is fast enough. This is currently 1s.

Not all endpoints have the same latency requirements. We are building the ability for each endpoint to define its own threshold in project &525 (closed).

Because this flexibility does not currently exist, when introducing Error Budgets to stage groups, there has been a request to increase the default latency threshold so that endpoints that are permitted to be slower will not be negatively impacted with their budget spend. As improvements are made in each stage group, we can then increase the default latency threshold.

Proposal

When we record latency measurements, we do not record the exact seconds that each request took. The duration is stored in buckets (for monitoring performance and storage reasons). The buckets are [0.1, 0.25, 0.5, 1.0, 2.5, 5.0] (found here).

We could use 2.5s or 5s as the default latency. The table below demonstrates using 5s as the default. (We used a 7 day calculation to make the data gathering easier)

Stage Group	Current availability (using 7 days) at 1s	Availability (using 7 days) at 5s	Improvement
source_code	99.9060% 🔴	99.9966% ✅	0.09%
access	99.9330% 🔴	99.9743% ✅	0.04%
code_review	99.5516% ✅	99.9699% ✅	0.42%
project_management	99.0660% 🔴	99.9421% 🔴	0.88%
global_search	97.8226% 🔴	99.3450% 🔴	1.52%

(detailed view of this data can be found on this issue: #1244 (closed))

If this approach is chosen, we recommend using the 5s bucket for the largest impact.

Edited Aug 20, 2021 by Rachel Nienaber