Site Reliability Engineering: Automated Restore Support
Summary
One of the core concepts to good Site Reliability Engineering is not just taking backups but restoring them and restoring them often. I see that the GitLab Helm Chart has a cronjob
object which spins up a short-lived gitlab-toolbox
pod to generate a backup file using GitLab's backup-utility
binary.
This is all great and fine! Let's take it a step further!
Feature Request
What I'd like to request is a cronjob
object which spins up a short-lived gitlab-toolbox
pod to take in whatever the latest restore file present is, and use the backup-utility
to attempt to restore it. The use case is of course a non-production environment to ensure that our restores are valid, useful, and helpful in verifying configuration changes for our use cases. We could then set alerts to read the max age within the non-production environment and alert on this system when age breaks a threshold.