Create PostgreSQL Snapshots before backing up database(s)
What does this MR do and why?
Gitlab is splitting up the database. Instead of having 1 big monolith database, we want to run Gitlab on two distinct PostgreSQL databases. As a result of having two databases, it becomes more difficult to achieve consistent database backups from our rake tasks. This MR will create database snapshots before the database backup is created.
It works like this:
- For each configured database, we run
pg_export_snapshot()
function and remember the snapshot_id created by PostgreSQL - For each database, pg_dump is called with
--snapshot
argument - After running pg_dump, we remove the snapshot by rolling back the transaction
This will not guarantee consistent backups but we are almost consistent: data modified during the period between the two pg_export_snapshot() calls could lead to inconsistencies. In many installations, the number of updates and inserts is low enough for this to be not an issue. The snapshots are taken one after the other and it is fast: a simple timing shows that we need 0.01 seconds for taking both snapshots.
Related to #366727 (closed)
Alternatives
- pg_dumpall can't be used because it does not guarantee consistency across databases
- pg_basebackup results in consistent (cluster-level) backups but has no option to filter out certain databases
Notable changes
Backup::Manager
We used to have distinct backup tasks for main
and ci
database. These are now merged into a resurrected db
task. The reason for this change: the Backup::Manager
should not be responsible for setting the PostgreSQL snapshots.
Backup::Database
- The
dump
method was not under test. So this MR will add tests for this method - The
dump
method will now first set the PostgreSQL snapshots and then, for each database, run a pg_dump - The
restore
method had to be changed since themain
andci
tasks has been removed. The calls togitlab:db:drop_tables
rake task have been replaced bygitlab:db:drop_tables:main
andgitlab:db:drop_tables:ci
, leading to a more complex rspec setup, because these rake tasks are dynamically generated. - The code complexity of
dump
method was too high. So the actual dumping and gzipping logic has been moved to a separate class:Backup::Dump::Postgres
How to set up and validate locally
Test the snapshotting of the database
- Modify
Backup::Database.dump
method: add a sleep(10) after the snapshots have been set - Run
SKIP=tar bundle exec rake gitlab:backup:create
- During the sleep(10), insert some data in a random table:
gdk psql
INSERT INTO shards (name) VALUES('4fded1464736e77865df232cbcb4cd19');
- When the backup is done:
gunzip -c tmp/backups/db/database.sql.gz | grep 4fded1464736e77865df232cbcb4cd19
- The text does not show up, so the snapshot works correctly: data inserted during backup is not included in the database dump
Test the backup and restore procedure
This procedure was tested with
- Multi database setup
- Single database setup
Test procedure (use 'main' database for single database setup):
- Insert some data in ci db. This database is empty on development, so add some data to it:
gdk psql -d gitlabhq_development_ci
INSERT INTO shards (name) VALUES('This should be restored');
- Check number of projects in main db:
gdk psql
-
SELECT COUNT(*) FROM projects;
(for example 8)
- Create backup
SKIP=tar bundle exec rake gitlab:backup:create
- insert some data in both databases, these rows should be gone after restore:
gdk psql
INSERT INTO shards (name) VALUES(current_database());
\connect gitlabhq_development_ci
INSERT INTO shards (name) VALUES(current_database());
- Run restore
bundle exec rake gitlab:backup:db:restore
- Now verify:
gdk psql
- Main database has just one record (name=default) in this table:
SELECT * FROM shards;
- Main database still has correct number of projects:
select count(*) from projects;
\connect gitlabhq_development_ci
- CI DB still has the experiment record that was added before creating the backup:
select * from shards
yields 1 record
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.