Add apdex-SLI measuring recording-rule-group execution for Prometheus & Thanos
A recording rule group in Prometheus or Thanos-ruler runs on an interval, usually 1m or 2m in our case. If the total execution duration exceeds that interval, the rule-group runs continuously, and we'll have less datapoints recorded than we would expect.
Prometheus & Thanos provide 2 metrics for this:
-
prometheus_rule_group_iterations_total: the total number of iterations -
prometheus_rule_group_iterations_missed_total: The number of executions that exceeded their interval
We could build an SLI from this that would apply to both ServicePrometheus & ServiceThanos, preferably modelled as an apdex.
Edited by Bob Van Landuyt