Automated testing of recovering PostgreSQL database backups
We need a procedure to recover a PostgreSQL backup. This could be something along the lines of the following:
Once a week a host is started automatically, or otherwise woken up somehow. Once it starts it will try to grab the most recent database backup and use this to start a local PostgreSQL database. Once this process has completed it will connect to the DB and check a few tables to see if there's data (e.g. users, projects, and issues). If these tables don't have any data, or the restoring process failed, errors should be sent to Slack.
This restore procedure should use the same underlying tools as a production engineer might use to manually restore a backup. This means we can automatically (and periodically) test these tools, removing the need for engineers to do this manually.
For information: you can actually use barman to restore directly a postgres data dir from your available (barman) backups.
So a potentially good flow would be to automatically and regularly run:
barman backup main-database
barman restore main-databse← at this point you have a restored data dir in for e.g.
- Start a postgres instance pointing to the
- Run a set of SQL queries on the restored base
In addition to checking if there is any data, you should probably check for having both new/fresh and old data, to catch various things that can go wrong (restored an old backup, restored an incremental backup without the base).
From an Email I got:
read the story and really sorry about it. The following are my thoughts only and not representing my company. Have some idea about automated backup #1102 Give a try to pipeline and break it up to stages: add marker to db backup to S3 size check request new machine restore backup check marker additional check what you can think of drop machine - if any of them fails, backup pipeline fails and ops need to take a look.
@tmaczukin I think this is what you saw: https://code.facebook.com/posts/1007323976059780/continuous-mysql-backup-validation-restoring-backups/
We should be looking at:
- last changed time
- file size comparable to what we expect
- object count increase in S3
I have added a new database node called dbbackup which will be used as backup node of the database and is currently setup a replication node.
Next step will be configure it so pg_dump is able to succeed.