This change was extracted from !3068 (merged), in order to simplify that MR and break it up.

Some History

A very long time ago (probably about a year ago!) before we used multi-window, multi-burn-rate alerting, we based all our alerting and SLA monitoring off a single 1m burn rate calculation.

Most of our monitoring was done at the service level, including apdex monitoring.

As a safety mechanism, when aggregating multiple metrics into the service-level apdex score, we used weighted averages using RPS over longer periods (originally 1h).

This was done, so that if a single component of the apdex dropped to 0rps, it would gradually fade out of the weighted average, rather than instantaneously.

The problem was that when we removed an apdex measurement, the aggregated apdex would drop until the weighted average had adjusted. This had a negative impact on our SLA calculation.

To fix this, we added a mechanism to filter out apdex scores that had been removed, this is the and on gitlab_component_mapping clause.

Since then, we have moved on a great deal.

Monitoring no longer uses a 1m burn
Monitoring now uses MWMBR alerts and is monitored at the SLI level for the most part
Service-level aggregations are not used for monitoring, only SLA calculation, so having the slowly moving weighted averages no longer makes sense.

So, where we have ended up is a great deal of technical debt, as complexity and surprising behaviour. None of it is useful in any way any longer, and as we refactor the way we calculate aggregations in !3106 (merged), this surprising behaviour is difficult to bring along.

This MR removes it.

What outcomes should we expect?

Alerting and monitoring will not be impacted as we no longer use the 1m burn rate.
SLA monitoring should not be impacted and if anything will be slightly more accurate.
Going forward, !3106 (merged) will allow us to move SLA calculation over to 5m range queries and totally drop the 1m burn rate queries.

Edited Jan 14, 2021 by Andrew Newdigate

Simplify service aggregated apdex calculation

Some History

What outcomes should we expect?

Merge request reports