Define alerting rules in Thanos for all clusters
Currently there are no alerting rules in the management cluster for workload cluster problems.
Thanos aggregates all metrics from all clusters (management + workload) and is connected to management cluster Alertmanager.
We need to define alerting rules that apply to all clusters for different components:
-
kubernetes -
node-exporter -
monitoring stack (prometheus, alertmanager, etc)
This is a good candidate for backporting in 1.3 in order to allow proper monitoring and alerting.
Edited by Alin H