Alert when a client cluster Prometheus goes down

Summary

Since we're using remote_write to send data to Thanos from each cluster, if a Prometheus server stops sending the data we have no way of knowing if the Prometheus/cluster went down or the cluster was deleted

We need to tell Thanos for which clusters to expect to receive data

Details

clusters.cluster.x-k8s.io lists all known clusters

Create a a helper to Thanos Ruler that watches the Kubernetes API for cluster resources and writes the clusters to a ConfigMap as recording rules
Mount the ConfigMap as a rules file in Thanos Ruler

We can then use these metrics to alert on known clusters that are not sending data

Expected output

rules:
- expr: vector(1)
  labels:
    cluster: management ### metadata.cluster.name
    k8s_api_group: cluster.x-k8s.io
    k8s_api_version: v1beta1
    namespace: sylva-system ### metadata.cluster.namespace
    script: k8s-clusters-list
  record: exported:script:k8s:clusters
- expr: vector(1)
  labels:
    cluster: cluster-test ### metadata.cluster.name
    k8s_api_group: cluster.x-k8s.io
    k8s_api_version: v1beta1
    namespace: workload-cluster-test ### metadata.cluster.namespace
    script: k8s-clusters-list
  record: exported:script:k8s:clusters

Edited Sep 25, 2024 by Alin H