Allow stage groups to see how their error budget is being spent

In &437 we're adding an error budget metric to the stage group dashboard.

Currently, we don't provide means to see where that spend is coming from. During the review requests, we were most of the time looking into the components with the highest failure rate per violation type (violation type being errors or apdex violations). I think we should show this information in a new (collapsed by default) row on the stage group dashboard.

This graph shows that information for all components and violation types. The highest one on that graph would be the most useful one to fix.

Next to that graph should be the explanation of the most common components, and what "violation types" mean:

Component	Violation	Explanation
puma	errors	Responses with status code 5xx
puma	apdex	Responses that took longer than the threshold to complete (1 second)
sidekiq_execution	errors	Sidekiq job failures
sidekiq_execution	apdex	Sidekiq jobs that took longer than the threshold to complete

Next to that, we could add links to kibana that show violating operations per component. I'm hoping we can build these using toolingLinks: #1071 (closed).

Old explanation

We could use the `gitlab:component:feature_category:execution:` metrics that still have the `feature_category`, `type` and `component` labels that should help stage groups determine where the budget is being spent. But how should we display that on the dashboards.

In the investigations (for example #1062 (closed)) I've done things like this in thanos. It splits both the apdex and error rates out of the budget and into separate graphs. And shows the success ratio for each.

Since this hides the total operation rate, we should still show a panel that shows the overall operation rate for the group, along these lines.

That way, the reader can weigh the importance of a low success ratio of a component.

When we've added these graphs, we should document them in the developer documentation for the stage group dashboards.

Outcome

Once we have updated the dashboards to provide an easier way to interpret the error budget spend, we should use them to generate two requests for each of the stage groups with the largest spend.

Edited Jun 04, 2021 by Rachel Nienaber