FY22-Q2: Establish Error Budgets as group level PIs for all Ops groups
Description
Establish sensing mechanisms for SaaS availability within Ops product groups.
Goals:
- Add appropriate Error Budget dashboards to team PIs (in handbook page)
- Validate dashboards are configured to provide a meaningful signal to the group.
- Document in handbook how Error Budgets are used in team process for prioritization, etc.
More context in handbook at: https://about.gitlab.com/handbook/engineering/development/ops/#error-budgets
More Info
In FY22-Q2 we are adopting Error Budgets as Performance Indicators for stage groups.
A key tool is the stage group dashboard, for example the is the Package Group's dashboard: https://dashboards.gitlab.net/d/stage-groups-package/stage-groups-group-dashboard-package-package?orgId=1. When look at the dashboard, a team member can quickly establish whether the error budget is being exceeded or not. As desicribed in the documentation, our error budget is made up of our Apdex performance as well as our error rate.
When evaluating the budget spent, performance information and endpoints contributing can be discovered in the dashboard for a team. For error rate, it's helpful to review the logs in Kibana for the feature categories for a team with HTTP response codes in the 500s. Here's an example for the Package group: https://log.gprd.gitlab.net/goto/c41e9f0228c80bd4d3f3104d4a8895ee
In making an evaluation as to what's contributing to a group's error budget, the dashboard can be used to determine whether there are any slow running queries whereas the logs can be used to determine which errors are actually occuring in production
DRIs
Table of DRI driving this for each product group
DRIs are listed in the Key Results in Ally https://app.ally.io/objectives/1335502?time_period_id=135090