Prometheus monitoring for backups
Our current alerting for backups (the ones that go to S3 I believe) involves sending Emails. This isn't working due to DMARC not being set up. Furthermore, these Emails may go to an inbox one might not actively check.
Backup alerts should instead use Prometheus alerts and be sent to Slack. These alerts should be triggered by some process that periodically checks (e.g. every hour) if we have all the right backups in place. Alerts should be sent when:
- The most recent backup is smaller than 300 GB (our current DB size)
- The last backup is more than N hours old (this depends a bit on how frequently we're running backups)
- The backup process failed for whatever reason. The alert should include the full output, or alternative store it somewhere and include a link in the output
Backup monitoring should also produce graphs in Grafana, showing data such as the number of backups, size of the last backup, etc.
This should also include monitoring for LVM snapshots, and Azure disk snapshots.
Related to: gitlab-cookbooks/gitlab-prometheus#19 (closed)
Edited by Pablo Carranza [GitLab]