Skip to content

Create a testing framework for recording- and alerting rules

In https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/6378, we accidentally broke recording rules for alerts: the recording rules evaluated successfully, but due to a label-mismatch, the alert would never fire.

This condition only occurred when alerting using confidence levels.

To fix the situation for GitLab Dedicated, and make sure it doesn't happen again when GitLab.com starts using confidence levels for alerting on lower traffic SLIs, we need to change all of the recording rules. Which is pretty scary to do, the only way we can currently validate those is running the query in Grafana in a timeframe that should have alerted and see if we the metrics show up.

It would be nice if we built some automated testing around this, some ideas so far:

  • As an integration test: spin up a Prometheus server incide a CI-job, preload it with some metrics, and run the queries from the rule under test against it to see if the results are valid. We could potentially add metrics from an actual GitLab installation.
  • As unit tests: Using promtool test: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/, we can specify metrics in a test-config file and a specific rule-file. Based on the inputs, we can validate that the rules have the expected outputs.

Discussed in gitlab-com/runbooks!7939 (comment 2137456257)