Alert when a client cluster Prometheus goes down
Summary
Since we're using remote_write to send data to Thanos from each cluster, if a Prometheus server stops sending the data we have no way of knowing if the Prometheus/cluster went down or the cluster was deleted
We need to tell Thanos for which clusters to expect to receive data
Details
clusters.cluster.x-k8s.io lists all known clusters
- Create a a helper to Thanos Ruler that watches the Kubernetes API for cluster resources and writes the clusters to a ConfigMap as recording rules
- Mount the ConfigMap as a rules file in Thanos Ruler
We can then use these metrics to alert on known clusters that are not sending data
Expected output
rules:
- expr: vector(1)
labels:
cluster: management ### metadata.cluster.name
k8s_api_group: cluster.x-k8s.io
k8s_api_version: v1beta1
namespace: sylva-system ### metadata.cluster.namespace
script: k8s-clusters-list
record: exported:script:k8s:clusters
- expr: vector(1)
labels:
cluster: cluster-test ### metadata.cluster.name
k8s_api_group: cluster.x-k8s.io
k8s_api_version: v1beta1
namespace: workload-cluster-test ### metadata.cluster.namespace
script: k8s-clusters-list
record: exported:script:k8s:clusters
Edited by Alin H