Build Streaming Database Restore
This issue covers both having a solid backup system and a continuous restoration of these backups to be sure that they actually work following investigations on different available tools
On February the 1st we had an outage that was caused by an error and showed us that our backups are actually not working at all simply because we never too the time to restore any.
Given that restoring a backup is a complex thing to do at our current scale, and that it needs to be resourced, we need to get some value of performing this restoration.
On this constraint I propose to use Wal-E to perform a streaming backup to S3 (or Azure object storage), and to then use this same streamed backup to restore staging databases as they are needed. Effectively solving building tooling for execution and rolling back migrations in staging by creating disposable staging environments.
This will solve both having a near zero loss backup, and testing restoring it by reusing it as yet another development tool.
How can we do it?
- Start using Wal-E to perform backups.
- Build a cookbook that we can apply to a host using chef-solo.
- Automate creating this host with terraform and applying the role so we can do it way much more often.
- Rinse and repeat.
I'm open to change this plan, but the general idea is clear - we have streaming backup, and we use this streaming backup to restore the database on a daily basis, effectively removing the pain of performing a restoration as it will become a standard procedure.
Have you considered using AWS RDS solutions for failover and backups rather than building your own? Is so, why did you choose to implement your own alternative?
We can't ship AWS RDS to our customers. Therefore we need to own our stack.
I have created an azure blob for storage of production data, this is called
dbprodbackups. Access Keys in 1Password DevOps vault under
Azure Blob DB Prod.
The environment needed to get started is:
root@db1:/home/jjn# apt-get install -y python3 python3-setuptools \ daemontools python3-pip lzop software-properties-common \ apt-transport-https wget python3-virtualenv pv
To keep our version of
wal-eseparate and ensure that changes to the base machine don't screw us, we'll be using python's virtualenv:
root@db1:/home/jjn# virtualenv -p python3 /opt/wal-e
Now we install
root@db1:/home/jjn# /opt/wal-e/bin/pip3 install wal-e
Since we're using Azure BLOB, we also need to install the
root@db1:/opt/wal-e# /opt/wal-e/bin/pip3 install azure
After setting up environment variables with env under
/etc/wal-e.d/envI tried using Azure SAS Tokens, this did NOT work, so we'll have to use the wider Access Keys method. Testing that communication works to Azure BLOB storage:
root@db1:/opt/wal-e# /usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e backup-list wal_e.main INFO MSG: starting WAL-E DETAIL: The subcommand is "backup-list". STRUCTURED: time=2017-02-13T11:25:22.546007-00 pid=54424 name last_modified expanded_size_bytes wal_segment_backup_start wal_segment_offset_backup_start wal_segment_backup_stop wal_segment_offset_backup_stop root@db1:/opt/wal-e#
This shows we're connecting and querying, now to put some backups there!
Added AWS credentials for IAM user
backup-db-prod, Amazon S3 support requires
boto, so this was installed with:
/opt/wal-e/bin/pip3 install boto
First problem I encountered was that wal-e couldn't find binaries to access the database, easily solvable with a PATH prefix to the run. Now I'm encountering connection errors:
root@db1:/var/opt/gitlab/postgresql/data# sudo -u gitlab-psql PATH=/opt/gitlab/embedded/bin:/opt/gitlab/embedded/sbin:$PATH /usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e backup-push /var/opt/gitlab/postgresql wal_e.main INFO MSG: starting WAL-E DETAIL: The subcommand is "backup-push". STRUCTURED: time=2017-02-14T08:48:30.106953-00 pid=40481 psql: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/tmp/.s.PGSQL.5432"? wal_e.main ERROR MSG: could not run one or more external programs WAL-E depends upon DETAIL: Could not get a connection to the database: note that superuser access is required Could not run the following programs, are they installed? psql STRUCTURED: time=2017-02-14T08:48:30.137849-00 pid=40481 root@db1:/var/opt/gitlab/postgresql/data#
I now have wal segments being shipped off to S3 in real-time, now just working on the backup-push.
We're off and to the races with a
backup-pushunderway using the command:
root@db1:~# sudo -u gitlab-psql PGHOST=/var/opt/gitlab/postgresql/ PATH=/opt/gitlab/embedded/bin:/opt/gitlab/embedded/sbin:$PATH /usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e backup-push /var/opt/gitlab/postgresql/data
... and the proof that we've got something:
root@db1:/var/opt/gitlab/postgresql/data# /usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e backup-list wal_e.main INFO MSG: starting WAL-E DETAIL: The subcommand is "backup-list". STRUCTURED: time=2017-02-14T17:29:38.377443-00 pid=10625 name last_modified expanded_size_bytes wal_segment_backup_start wal_segment_offset_backup_start wal_segment_backup_stop wal_segment_offset_backup_stop base_0000000200000B4800000078_13127472 2017-02-14T11:40:56.000Z 0000000200000B4800000078 13127472 root@db1:/var/opt/gitlab/postgresql/data#
Now to test a restore!
I'm currently running the wal-e
push-backupjob manually so that we have backups of the database. This is easier on the database than
marked the task Start using Wal-E to perform backups. as completedToggle commit list
We are now sending backups and wal streams to S3 from db1
:woohoo: <- we seriously need custom emoji here
Restoration tested and verified. Restoration was performed using the following process:
Stop the database:
root@db2:/home/jjn# gitlab-ctl stop postgresql
create the recovery.conf file:
root@db2:/home/jjn# vi /var/opt/gitlab/postgresql/data/recovery.conf
This file should look like:
restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch "%f" "%p"'
recovery_target_time = '2017-02-01 02:12:00'
pause_at_recovery_target = 'false'
recovery_target_timeis the exact time that you'd like recovered to, or you may use the
latesttime frame to pull the latest full backup and play all of the available wal file segments to bring you into full parity with the production server.
In order to kick off this process, all you have to do now is start the database:
root@db2:/home/jjn# gitlab-ctl start postgresql
When the database recovery is done, the
recovery.conffile will be renamed to
recover.doneand the postgres database will be available for use.
Starting from nothing it took just under two hours (1:48) to pull the data back and restore it to a point I specified.