Make it possible to remove corrupted data from thanos over a range of time

In production#7751 (closed) recording rules were failing and a lot of corrupted data was stored in Thanos.

This data affects the Availability number we publish every month. For example:

image Grafana

This dip on the 17th (24 hours before the 17th at 00:00) is mostly because of the incident in production#7751 (closed).

This also affects error budgets for stage groups, for example for groupsource code: https://dashboards.gitlab.net/d/stage-groups-detail-source_code/stage-groups-source-code-group-error-budget-detail?orgId=1&from=1663283691823&to=1663296608093

image

We should remove the Thanos data for the affected timerange (2022-09-16 00:27 - 2022-09-16 01:36). I think no data would be better than misleading data for error budgets and GitLab.com availability.

I think we should remove data for the following aggregation sets:

  • componentSLIs
  • regionalComponentSLIs
  • nodeComponentSLIs
  • serviceSLIs
  • nodeServiceSLIs
  • regionalServiceSLIs
  • featureCategorySLIs
  • serviceComponentStageGroupSLIs
  • stageGroupSLIs

As well as the recordings for sla:gitlab:ratio. This would make sure that this period is not affecting error budgets for stage groups or our availability.

Ideally, we'd have this as a repeatable process so we can apply the same for maintenance windows in the future.

Edited by Bob Van Landuyt