Error Budgets as a Performance Indicator
In this issue, I'd like to discuss the idea of using Error Budgets as a performance indicator for the Scalability Group.
Background
Error Budgets are a single figure that we present to each Stage Group showing how reliable and performant their feature categories have been over the past 28 days. The target is 99.95%. Error Budgets are made up of latency measurements (apdex) and error counts. They were successfully introduced in 2021 and are part of the planning process for the Development teams.
They were also a key contributor to improving the overall reliability of the system at a time when reliability had been poor. They are an important part of our engineering processes and we need to make sure that we continue to improve them so that they remain relevant and useful for the teams.
Why a Performance Indicator?
Having this as an indicator will help drive Error Budgets forward. We can use this indicator to set SMART OKRs for improving Error Budgets and we can use it to help keep Error Budgets up-to-date and relevant in the organization.
What would we measure and what would the targets be?
The Error Budget figure in isolation won't work, it needs to be paired with a measure of quality - so I'm proposing two indicators.
Error Budget Completeness (or Maturity, or Accuracy)
We know that the Error Budgets cover a certain number of things like Rails endpoints, Sidekiq errors etc. We also know there is a list of other items we still need to add - GraphQL, Puma improvements, and figure out how to handle Database problems to name a few. So if we have 10 data points for Error Budgets, and we know that we only include 6 items at the moment, then Error Budgets are 60% complete. If we find two more items to add, then we are at 6/12 and our completeness drops to 50%.
The target would be 100%.
Error Budget Availability (Spend isn't the right word...)
This would be some aggregate spend across all stage groups. I'm not sure if it should be an average across all stage groups or weighted in some way based on traffic share.
The target would be 99.95%.
How would these indicators work?
Let's look at some situations:
- We want to increase completeness on the Error Budgets so we make it possible for GraphQL endpoints to be included. As soon as we do this, the Error Budget spend increases because there is more granularity on the spend. We then need to work with the stage groups to get the spend back on target.
- The Error Budget Availability is above 99.95% which means that teams aren't spending enough of their budget. We work with teams to tighten up their SLOs. Tightening up SLOs could also be part of the completeness indicator.
This proposal is still rough and I welcome any feedback.