feat: Add recording rule duration SLI
feat: add errorRateApdex
An errorRateApdex will allow us to define SLIs using an error & a total counter. It will translate into an SLI similar to our other apdexes (histogram & success-rate) that have an ideal ratio of 100%.
feat: Add recording rule duration SLI
This adds a new SLI to both Thanos and Monitoring (for Prometheus). This SLI keeps an eye on rule-group durations: every time a duration exceeds it's interval will be counted as an error, meaning we need to work on improving the duration of that rule group.
We're recording this as an apdex, because this SLI talks about the latency (duration) of a recording rule group.
For gitlab-com/gl-infra/scalability#2204 (closed)
- Thanos dashboard snapshot: https://dashboards.gitlab.net/dashboard/snapshot/MrJt69dBiqq9TFmexs8atsrWGJgcNFgQ?var-environment=thanos
- Prometheus dashboard snapshot: https://dashboards.gitlab.net/dashboard/snapshot/ebZuAG3nL3B4FzvIP8cX94rSv9sdJrEq
Current state of the new SLI:
Edited by Bob Van Landuyt