Error Budgets: Introducing the ability to set custom target durations for Web and API requests
Background
There is a default apdex threshold used for every endpoint which is currently set to 5 seconds. When a request duration exceeds 5 seconds, that request is considered in violation of this threshold and counts against the apdex portion of the Error Budget.
This mechanism does not cater for the different performance characteristics of various endpoints. Certain types of work are expected to take longer, and the first iteration of Error Budgets did not cater to this need.
Key Takeaways
Summary of the rest of the detail in this issue:
- Stage groups can now set a custom target duration for Web and API requests and there is documentation available about how to do this
- The new target durations won't immediately be used for error budgets, but when &573 (closed) is done, the groups will be able to control when they switch to the new value
- All endpoints will continue to be measured at 5-seconds until the stage group chooses to switch to the new calculation method using custom target durations.
Custom Target Durations per Endpoint
In &572 (closed), we added the ability to define target durations for each Web and API endpoint. These custom target durations are not yet included in the Error Budget calculations. Stage groups are able to start setting target durations while we connect this feature up to the Error Budget calculations.
How to set target durations
Stage groups can follow the new developer documentation to set target durations and should ask for a review from a Scalability Team Member on the MR.
The developer documentation includes instructions for how to choose a target duration, how to set the duration, and what considerations should be made.
What target duration options are available?
This table shows the duration options available. The count
and count percentages
columns show how many Web and API requests over the past 7 days fell within these ranges. (The data is taken from https://log.gprd.gitlab.net/goto/14f17a87e0e74094904733ff1064233c)
Target Duration Name - Urgency | Duration | Count | Count percentages | Cumulative | Notes |
---|---|---|---|---|---|
high | ≥ -∞ and < 0.25 | 6,253,110,624 | 96.49% | ||
medium | ≥ 0.25 and < 0.5 | 142,614,661 | 2.20% | 98.69% | |
default | ≥ 0.5 and < 1 | 61,343,281 | 0.95% | 99.64% | This will be the default if no target threshold is defined for an endpoint |
low | ≥ 1 and < 5 | 22,725,847 | 0.35% | 99.99% |
Target durations should be chosen from the table above. We've added these categories to choose from so it is easier to categorize endpoints and collect information on which kind of endpoints are handled by which fleet.
If a target duration is not specified for an endpoint, the target of 1s (default) will be used as default. This should be sufficient for the majority of endpoints. The target duration of 5s (low) should be an exception. We're happy to take feedback on these categories.
What happens next after target durations are set?
When target durations are set, the system will start recording data for those targets as soon as the MR is deployed to production.
Error budgets will continue to use the current default of 5 seconds until the stage group chooses to move to the new target duration method.
When the next project &573 (closed) is completed, teams will be able to see their existing error budgets (using the 5s duration thresholds) as well as what their error budget will be when using target durations per endpoint.
Well, we could just set all of our target durations to 5-seconds and have a green error budget...
Yes, you could, and we aren't building a tool to automatically prevent this.
Adjusting these thresholds should be done in function of user expectation, and will result in indicators that tell us how users are experiencing the service more accurately. Team Members building the features are best informed to specify what an acceptable threshold is.
Error budgets are also designed to help make sure that the infrastructure that is used to serve the application has enough resources to adequately meet the needs of all stage groups. If all endpoints require a 5-second threshold, we will need to invest in having more infrastructure resources available to handle those, and may also need to engage in various additional application and infrastructure work to scale to higher parallel connections, jobs, etc...
A request that is taking 5-seconds is holding onto resources that other requests need to use. So knowing how they are distributed helps us scale accordingly.
As part of &573 (closed) we will also deliver graphs that show the distribution of target durations. These can be used to make sure we do not have too many low targets across the application.
Next steps
Stage Groups
- Read the developer documentation available here for information on how to set custom target durations
- Go through the endpoints for your feature categories, and determine which endpoints should have a target other than the new default of 1-second.
- For these endpoints, set a new target duration and include a Scalability team member on the MR.
- Wait for further instructions on how to switch your Error Budget to use these new targets.
Remember that the code to use these new target durations in Error Budgets is not yet complete. So your Error Budget will not be affected by these changes until &573 (closed) is completed and you have opted into the new calculation mechanism.
Scalability Team
- Continue to work on &573 (closed) to include target durations in the Error Budget Calculations
- Prepare communications for the Stage Groups for how to switch to the new Error Budget calculations.
- Help with MR reviews where new target durations are set.
Feedback and Questions
If you have any questions, concerns or feedback, please comment on this issue below.