Skip to content

Prometheus meta-monitoring for zonal cluster shards

Problem

Within a given Kubernetes cluster (or GCE project) we rely on the default Prometheus shard to monitor other shards (e.g. app, db). Note that at the time of writing, there are only default shards in GKE.

If the default Prometheus or Alertmanager is down, we rely on Dead Man's Snitch to notify us. We have snitches configured for all default shards (GCE and GKE) except the zonal cluster Prometheus shards. We do not have a way of knowing when monitoring is broken in our zonal clusters.

Desired outcome

The most pragmatic solution for now is probably to add snitches for each gprd zonal cluster, and configure routing rules in Alertmanager to dispatch zonal cluster snitch alerts to the relevant (secret) snitch route, rather than dropping them as it does now.

Any alternative would involve cross-cluster monitoring, which is fiddly.

Acceptance criteria

  • SREs are notified when Prometheus in the zonal clusters cannot send alerts.
Edited by Craig Furman