Error Budgets: Using custom target durations for Web and API requests inside your Error Budget
Background
In #1315 (closed) we introduced the ability to set custom target durations on Web and API requests so that different types of endpoints could be held to different standards as the stage groups see fit.
In the spirit of iteration, this feature was released before the custom targets were incorporated into the Error Budgets themselves.
In this issue, we will describe how all stage groups can now opt-in to using these custom target durations.
How do we opt-in?
The process is described in detail in this documentation: https://docs.gitlab.com/ee/development/application_slis/rails_request_apdex.html
The key piece is to edit the teams.yml file to remove rails_requests
from ignored_components
key from your stage group.
How do we track who has opted in?
There is a new dashboard to view this detail: https://dashboards.gitlab.net/d/general-request-apdex-sli-adoption/general-request-apdex-participation?orgId=1
This dashboard can be filtered by Product Stage and by Stage Group.
In the table, there is a comparison of old and new apdex ratios to show how each stage group will be affected by opting into the new calculations. The error rates are not included in this table, that component will remain the same and is not affected by opting into using the new apdex in the error budget.
The blue boxes at the top show the percentage of traffic and the number of endpoints that are set to use each of the urgency targets.
Well, we can still just set all of our target durations to 5-seconds and have a green error budget...
Yes, you can. Teams should opt into the new methods, even if they maintain their current 5-second apdex targets.
We need teams to opt-in to the new calculation methods by the end of FY23Q1 so that we can improve how we are monitoring these endpoints. At the moment, we collect more data than we need and we want to optimize this for the health of the monitoring stack. Opting into the new apdex calculations also gives us more interesting and helpful ways of displaying this data, some of which are described here.
When teams are using the new calculation method, we are happy to help with any effort on increasing the urgency of an endpoint: If a long duration of a specific endpoint would negatively affect the user experience, but a lower urgency was needed to set to make the endpoint meet the target duration, we could try to come up with a plan to improve performance and increase the assigned urgency.
Next steps
- If your stage group hasn't already - set up custom targets for your Web and API endpoints
- Opt into using the new calculations by editing the teams.yml file
- Please note - the table in the above dashboards comparing the ratios looks at a 7d range by default. This can be changed using Grafana's timepicker. The top rows listing endpoints always look at the last 6h
Feedback and Questions
If you have any feedback or questions, please comment on this issue.