Proposal of Red/Green Error Budgets

Background

We are struggling with how to correctly weight the database SLI information into the existing Error Budgets. The operation count for the database component is an order of magnitude larger than the existing components and as a result, the database component drowns out the others.

One option is to hard-code weights for each component. If we did this, we will be negotiating the weights all the time to try tweak the outcome to be what we want. Also, with different stage groups having different demands on services, the weighting would need to be on a per stage group basis.

The final piece of context is that stage groups currently use the target availablility (99.95%) as the call to action. If they are below the target, they are required to act. In practise however, the data from Error Budgets is not always used for prioritization. One of the theories about this behaviour is that because the Error Budget is made up of so many components and because there is often no single obvious outlier, that it becomes hard to priorize many different pieces of work across many different components.

Proposal

Move away from using the lanuage of "Availability Target" and simplify the output to "Red" or "Green". Either the target is achieved or it is not, and the value no longer matters.
For each stage group, list the components that they use and if they are Red or Green for each component.
If they are Red for any component, then the overall indicator is Red. If all components are Green, then the overall indicator is green.
A box goes red if the stage group's apdex for that service is lower than the SLO target for that service.

We could then use this system as a single pane of glass for viewing operational requests for the stage groups. For example, we could have a component that is "Infradev issues past due" as shown below.

This becomes a clear indicator for stage groups that action is needed.

Services

The opposite side of this proposal is that we could provide a similar view for service owners, where Red stage groups are listed. Having this information means that service owners know who to connect with to find out more or to offer assistance to resolve availability/reliability/performance concerns.

Other

Terminology

In the description here I have talked about the Error Budget being made up of components. But we also talk of services being made up of components.

The terminology of components making up services is widely used at GitLab so we'll need to find another word to mean "things that make up an Error Budget".

How do other companies solve this problem?

We aren't the only company who uses Error Budgets. However after reading through quite a few articles, all of the examples were looking at error budgets from a service perspective. They roll up SLOs to the service level, and then communicate the health of the service. Users (developers) are shown the set of services that they rely on.

Error Budgets also seem to be focused on wholly-owned components. That is, services or API end points who have a single owner. I can't find examples of companies who split ownership.

I can say, from all the reading, once we our next round of Error Budget work that we should blog about it. Most of these articles are so shallow and don't show practical examples beyond how to calculate a basic budget.

Where would this view go?

We could put it in the Data Index: https://gitlab-com.gitlab.io/gl-infra/platform/stage-groups-index/

What happens to the Error Budget Report?

We should continue to issue a report monthly to show people where to focus. Instead of listing all teams, we should only list those with high traffic share who are red. We can also include a new section to show teams who have been red for three months or more. The report cadence remains the same, but the information overhead is lower.

How do we make sure that Stage Groups are prioritizing improvements?

At the moment we don't correctly look for trends where teams are repeatedly over budget. If teams are repeatedly over budget, we should look to introduce an FCL for that team.

This change could be made with the existing system.

Edited Jun 28, 2023 by Rachel Nienaber