Unified Backups: Investigate an alternative to STRATEGY=copy
Context
When you run a backup against a very busy installation, you may experience a racing condition between files being created/removed and a backup having already started.
In the past we've provided a workaround in the form of Backup Strategy options: https://docs.gitlab.com/ee/administration/backup_restore/backup_gitlab.html#backup-strategy-option
While we are revisiting our backup solution, I think we can provide a better alternative.
The current default solution works fine, even in larger installations, as long as there is nothing changing existing files.
This can be achieved by restricting traffic to the servers, or by enabling Maintenance mode.
The existing workaround, to use STRATEGY=copy
has a large downside to it: it requires at least twice the storage available (in order to be able to duplicate the files before adding them to a backup file).
Possible alternatives to investigate
The current limitation seems to be due to pipping a list of files to a tar + compression command. If any of those files doesn't exist anymore at the time tar
is reading it, or any of the files is still being modified/written to, then you have a failed operation.
We should investigate whether we could process the list of files from ruby code in chunks and "Append to a tar file", in a way that we could recover from failure and "retry". (need to verify whether appending with compression is possible or not)
Alternative to append and retry, we could limit the duplication to chunks of files (or to a certain buffer size) so that we don't waste too much disk space.
We could find another container solution that provides this type of capacity/atomicity if tar
doesn't:
-
https://stackoverflow.com/questions/24224134/alternative-to-tar-for-directories-with-changing-files~~~~~~~~ suggests(this does not solve the compression part, whilepax
existspax
is an alternative totar
it has the same limitations of preventing updates when the file compressed file is created, so we can't implement and "append" approach. - Could be something supported by p7zip https://wiki.archlinux.org/title/p7zip (they have a disclaimer that you should not use it for backup purpose because it doesnt preserve the user/group, but in our case this is not a concern, as we have a predefined one already, and the data should not have files with different user or groups inside it)
- The .7z file format seems to fit the bill here and we should investigate it (you could create the final compressed file and append each step to it reducing the overall disk-usage)
-
https://peazip.github.io/paq-file-format.html
zpaq
seems to be another alternative with similar characteristics
- https://peazip.github.io/archive-file-formats-comparison.html has a good list of archiving formats
We could consider whether a "remote" approach, relying on an external server would work better (rsync to a temporary server/storage and run tar/compression from there)
We could consider supporting an LVM + snapshot approach natively, as per documented here: https://docs.gitlab.com/ee/administration/backup_restore/backup_gitlab.html#alternative-backup-strategies and making that a REQUIRED/RECOMMENDED for larger installations
We could consider supporting other forms of filesystem snapshotting (similar to the LVM approach) including btrfs
and/or zfs
and making that a REQUIRED/RECOMMENDED approach for larger installations
We could consider an hybrid approach of relying on an external storage server that includes some form of snapshotting and having it rsync from existing server to the external one, while keeping data duplicated there, and running backups using a snapshotting before running it.
We could consider the same snapshotting strategy, but running from a Geo secondary site. This could give us a possiblity of "Point in time" integrity:
- We would pause database replication at time of backup
- We would wait replication to finish (so that stored blobs do match the state of the database)
- Then we could start a database dump + filesystem snapshot
- After snapshot is completed database can be resumed again, and we initiate the backup from the mounted snapshot