Unified Backups: Investigate disk-snapshottting as an alternative to STRATEGY=copy

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Context

When you run a backup against a very busy installation, you may experience a racing condition between files being created/removed and a backup having already started.

In the past we've provided a workaround in the form of Backup Strategy options: https://docs.gitlab.com/ee/administration/backup_restore/backup_gitlab.html#backup-strategy-option

While we are revisiting our backup solution, I think we can provide a better alternative.

The current default solution works fine, even in larger installations, as long as there is nothing changing existing files.

This can be achieved by restricting traffic to the servers, or by enabling Maintenance mode.

The existing workaround, to use STRATEGY=copy has a large downside to it: it requires at least twice the storage available (in order to be able to duplicate the files before adding them to a backup file).

Possible alternatives to investigate

The current limitation seems to be due to pipping a list of files to a tar + compression command. If any of those files doesn't exist anymore at the time tar is reading it, or any of the files is still being modified/written to, then you have a failed operation.

We should investigate whether we could process the list of files from ruby code in chunks and "Append to a tar file", in a way that we could recover from failure and "retry". (need to verify whether appending with compression is possible or not)

Alternative to append and retry, we could limit the duplication to chunks of files (or to a certain buffer size) so that we don't waste too much disk space.

We could find another container solution that provides this type of capacity/atomicity if tar doesn't:

We could consider whether a "remote" approach, relying on an external server would work better (rsync to a temporary server/storage and run tar/compression from there)

We could consider supporting an LVM + snapshot approach natively, as per documented here: https://docs.gitlab.com/ee/administration/backup_restore/backup_gitlab.html#alternative-backup-strategies and making that a REQUIRED/RECOMMENDED for larger installations

We could consider supporting other forms of filesystem snapshotting (similar to the LVM approach) including btrfs and/or zfs and making that a REQUIRED/RECOMMENDED approach for larger installations

We could consider an hybrid approach of relying on an external storage server that includes some form of snapshotting and having it rsync from existing server to the external one, while keeping data duplicated there, and running backups using a snapshotting before running it.

We could consider the same snapshotting strategy, but running from a Geo secondary site. This could give us a possiblity of "Point in time" integrity:

  • We would pause database replication at time of backup
  • We would wait replication to finish (so that stored blobs do match the state of the database)
  • Then we could start a database dump + filesystem snapshot
  • After snapshot is completed database can be resumed again, and we initiate the backup from the mounted snapshot
Edited by 🤖 GitLab Bot 🤖