Unified Backups: Investigate disk-snapshottting as an alternative to STRATEGY=copy

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Context

When you run a backup against a very busy installation, you may experience a racing condition between files being created/removed and a backup having already started.

In the past we've provided a workaround in the form of Backup Strategy options: https://docs.gitlab.com/ee/administration/backup_restore/backup_gitlab.html#backup-strategy-option

While we are revisiting our backup solution, I think we can provide a better alternative.

The current default solution works fine, even in larger installations, as long as there is nothing changing existing files.

This can be achieved by restricting traffic to the servers, or by enabling Maintenance mode.

The existing workaround, to use STRATEGY=copy has a large downside to it: it requires at least twice the storage available (in order to be able to duplicate the files before adding them to a backup file).

Possible alternatives to investigate

The current limitation seems to be due to pipping a list of files to a tar + compression command. If any of those files doesn't exist anymore at the time tar is reading it, or any of the files is still being modified/written to, then you have a failed operation.

We should investigate whether we could process the list of files from ruby code in chunks and "Append to a tar file", in a way that we could recover from failure and "retry". (need to verify whether appending with compression is possible or not)

Alternative to append and retry, we could limit the duplication to chunks of files (or to a certain buffer size) so that we don't waste too much disk space.

We could find another container solution that provides this type of capacity/atomicity if tar doesn't:

~~https://stackoverflow.com/questions/24224134/alternative-to-tar-for-directories-with-changing-files~~~~\~\~\~\~ suggests pax exists~~ (this does not solve the compression part, while pax is an alternative to tar it has the same limitations of preventing updates when the file compressed file is created, so we can't implement and "append" approach.
Could be something supported by p7zip https://wiki.archlinux.org/title/p7zip (they have a disclaimer that you should not use it for backup purpose because it doesnt preserve the user/group, but in our case this is not a concern, as we have a predefined one already, and the data should not have files with different user or groups inside it)
- The .7z file format seems to fit the bill here and we should investigate it (you could create the final compressed file and append each step to it reducing the overall disk-usage)
- https://peazip.github.io/paq-file-format.html zpaq seems to be another alternative with similar characteristics
  - https://manpages.ubuntu.com/manpages/jammy/man1/zpaq.1.html
https://peazip.github.io/archive-file-formats-comparison.html has a good list of archiving formats

We could consider whether a "remote" approach, relying on an external server would work better (rsync to a temporary server/storage and run tar/compression from there)

We could consider supporting an LVM + snapshot approach natively, as per documented here: https://docs.gitlab.com/ee/administration/backup_restore/backup_gitlab.html#alternative-backup-strategies and making that a REQUIRED/RECOMMENDED for larger installations

We could consider supporting other forms of filesystem snapshotting (similar to the LVM approach) including btrfs and/or zfs and making that a REQUIRED/RECOMMENDED approach for larger installations

We could consider an hybrid approach of relying on an external storage server that includes some form of snapshotting and having it rsync from existing server to the external one, while keeping data duplicated there, and running backups using a snapshotting before running it.

We could consider the same snapshotting strategy, but running from a Geo secondary site. This could give us a possiblity of "Point in time" integrity:

We would pause database replication at time of backup
We would wait replication to finish (so that stored blobs do match the state of the database)
Then we could start a database dump + filesystem snapshot
After snapshot is completed database can be resumed again, and we initiate the backup from the mounted snapshot

Edited Aug 09, 2025 by 🤖 GitLab Bot 🤖