fix: sensative alerting for PatroniGCSSnapshotFailed
What
Change the PatroniGCSSnapshotFailed
to trigger after 6h instead of 5m.
Why
We've been seeing multiple temporary failures of the GCS Snapshot script:
- gitlab-com/gl-infra/production#7996 (closed)
- gitlab-com/gl-infra/production#7849 (closed)
- gitlab-com/gl-infra/production#7742 (closed)
Looking at the cronjob this runs every 6th hour:
steve@patroni-main-2004-10-db-gprd.c.gitlab-production.internal:~$ sudo -u gitlab-psql crontab -l | grep 'gcs-snapshot'
0 */6 * * * /usr/local/bin/gcs-snapshot.sh
Increasing the alert to 6h
will only alert the on-call if it has
failed twice in a row.
Reference: gitlab-com/gl-infra/production#7996 (closed)