Define alerting rules in Thanos for all clusters

Currently there are no alerting rules in the management cluster for workload cluster problems.

Thanos aggregates all metrics from all clusters (management + workload) and is connected to management cluster Alertmanager.

We need to define alerting rules that apply to all clusters for different components:

  • kubernetes
  • node-exporter
  • monitoring stack (prometheus, alertmanager, etc)

This is a good candidate for backporting in 1.3 in order to allow proper monitoring and alerting.

cc @tmmorin @matrohon @marc.bailly1

Edited Sep 15, 2025 by Alin H
Assignee Loading
Time tracking Loading