Skip to content

feat(alert): create issue when statefulset doesn't match for 12h

Steve Xuereb requested to merge feat/sts-alert into master

Background

In gitlab-com/gl-infra/production#5466 (closed) we saw Prometheus not able to start for a long time because it was trying to load a large WAL file, and was killed by Kubernetes because it was reaching the threshold.

We only realized this when someone was looking at the cluster and wasn't alerted, this means that we lost HA on our Prometheus setup because 1 of the pods were in a restart loop, and if the other pod was rescheduled we would have had no visibility whilst trying to recover the WAL.

Solution

Create an issue in the production issue tracker when a statefulset isn't the expected number for 12h, since Prometheus restarts can take long time.

This creates some visibility when a statefulset is broken but doesn't page the on-call since it's just a cause, not a symptom.

We could create a specific rule just for the prometheus statefulset but it is useful to have it for all statefulsets since those pods are usually more hands-on.

reference gitlab-com/gl-infra/infrastructure#14359

Edited by Steve Xuereb

Merge request reports