Open production issues for SLO alerts for services owned by the Observability SRE Team
With the rollout of Thanos Frontend, we'll have a better cache and proxy in front of the query server. Which, should make what I'm proposing in this issue an even better idea. But, why wait!
Our thanos_query SLO violations generally occur due to specific poor-performing queries: user initiated or Grafana dashboards. Each dip is an opportunity to optimize a dashboard, or reach-out to a team member and offer our assistance. The best place to do this is in an individual issue.
Definition of Done
-
each thanos_querySLO violation opens an Incident in the production tracker via GitLab's Alertmanager integration -
each thanos_querySLO violation alert routes directly to the#sre_observabilitySlack channel- this doesn't seem to work. Let's discuss it in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12131
Nice To Have
-
each of the Incident issues is assigned to @sre-observability
Edited by Craig Furman