Skip to content

fix: sensative alerting for PatroniGCSSnapshotFailed

Steve Xuereb requested to merge fix/gcssnapshot-alerting into master

What

Change the PatroniGCSSnapshotFailed to trigger after 6h instead of 5m.

Why

We've been seeing multiple temporary failures of the GCS Snapshot script:

Looking at the cronjob this runs every 6th hour:

steve@patroni-main-2004-10-db-gprd.c.gitlab-production.internal:~$ sudo -u gitlab-psql crontab -l  | grep 'gcs-snapshot'
0 */6 * * * /usr/local/bin/gcs-snapshot.sh

Increasing the alert to 6h will only alert the on-call if it has failed twice in a row.

Reference: gitlab-com/gl-infra/production#7996 (closed)

Merge request reports