Observability investigation - 2023 August Thanos service slack alerts on ThanosServiceRuleEvaluationErrorSLOViolation
Making tracking issue related to Slack alerts we have on ThanosServiceRuleEvaluationErrorSLOViolation.
Sample Slack alert:
firing - Service thanos (thanos)
:fire: Alerts :fire:
ThanosServiceRuleEvaluationErrorSLOViolation :point_right: Thanos Graph
The rule_evaluation SLI of the thanos service (main stage) has an error rate violating SLO
This SLI monitors Prometheus recording rule evaluations. Recording rule evalution failures are considered to be service failures. Warnings are also considered failures.
Rule groups are evaluating recording rules in a group in sequence at an interval. If the recording of all rules in a groups exceeds the interval for the group, we could be missing data points in the group.
If a group is slow often, we should split it up or improve query performance
To see which rules are often not meeting their target. Look at the SLI-details. The rule_group label will contain information about the slow group.
Currently the error-rate is 0.7365%.
:label: Labels :label:
alertname: ThanosServiceRuleEvaluationErrorSLOViolation
aggregation: component
alert_type: symptom
component: rule_evaluation
env: thanos
ruler_cluster: thanos
sli_type: error
stage: main
team: reliability_observability
tier: inf
type: thanos
window: 6h