Verify recording rules in Mimir
Context
Since we have been migrating countless of recording rules from Thanos to Mimir, it becomes a big, manual challenge to verify whether all recording rules actually record something.
The rules in Mimir might not exist because of:
- Missing metric, eg https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3401#note_1875277139
- Missing source label that is required from the SLI definition, eg https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3401#note_1875308140
- Missing of hardcoding
monitor="global"
in one of the recording rules chain (since we want to make our dashboards backwards compatible with Thanos). - etc
Proposal
The idea here is to write a validation script that lists down all the recording rules in mimir-rules/
and queries both Thanos and Mimir. We could start with checking whether any data exists in Mimir. Listing down some scenarios that could happen:
- Data exists in both Thanos and Mimir -->
✅ all good - Data exists only in Thanos, but not in Mimir --> 🟠 log the recording rule down, check where's the gap
- Data exists only in Mimir, not in Thanos --> 🟠 log the recording rule down, check where's the gap, maybe fine in some cases.
- Data doesn't exist in both Thanos and Mimir -->
❓ this might be fine, as some recording rules also don't have any data in Thanos.
This allows us to do a (quick?) sanity check for any missing piece in the recording rule chain. In the next iteration, we could also include more complicated analysis for the actual correctness between the two #2876 (closed).
@abrandl suggested that we could start by using Jupyter Notebook in this project.
Some caveats:
- Some recording rule names don't exactly match between Thanos and Mimir, the most prominent ones that I know of -
sli_aggregations:xxx_rate_5m
(Thanos) vssli_aggregations:xxx:rate_5m
(Mimir). This SLI aggregations recording rule also is not present for all SLIs in Thanos, but present for all SLIs in Mimir. We don't need to check this as all other recording rules for SLIs depend on this.
What to check
- Recording rules existence in Mimir ( #3444 (comment 1881345969))
- Compare count() of recording rules between Mimir and Thanos https://docs.google.com/spreadsheets/d/1Jy8L7RDQIE5uO7Xf8A2UgvwMEMv7mXsEjOeegjNAptQ/edit#gid=2025417442
- Compare count() of
gitlab_component_(ops|apdex)
for each SLI and type. (done via manual query https://docs.google.com/spreadsheets/d/1Jy8L7RDQIE5uO7Xf8A2UgvwMEMv7mXsEjOeegjNAptQ/edit#gid=1414211851) - Compare count() of
gitlab_component_saturation:ratio
for each saturation point. https://docs.google.com/spreadsheets/d/1Jy8L7RDQIE5uO7Xf8A2UgvwMEMv7mXsEjOeegjNAptQ/edit#gid=1330753862