Skip generating a Git bundle if an up to date bundle already exists in the previous backup
Problem to solve
When using sudo gitlab-backup create, backing up Git repositories is very slow. One reason for this is that all repositories are backed up, even if they haven't changed since the last backup.
Further details
On very large GitLab instances, it is likely that most projects are not updated every hour. By skipping repositories that have not changed, this allows more frequent backups without the high compute cost of backing up all the changes.
Proposal
As an administrator, I will be able to run sudo gitlab-backup create and provide a path to a previous backup (e.g. PREVIOUS_BACKUP=path/to/last/backup).
If a path to a previous backup is provided, if the repository checksum on the sever and previous backup:
- same checksum: reuse the Git bundle from the previous backup
- different checksum: generate a new Git bundle for the backup
This is a performance optimization by skipping unnecessary bundle creation. This is not a storage optimization, and does not change the storage format.
Checksum comparison
The checksum of a repo bundle must be compared to the current repository checksum. Possible approaches include:
- store the checksum with the backup in the filename
- store the checksum with the backup in a manifest file of some sort
- generate the checksum from the bundle on the fly
- store the checksum in the database
When evaluating these approaches we should consider:
- time efficiency to restore
- time efficiency to backup
- use standard Git bundle output (using a custom bundle format should be avoided)
Support for SKIP=tar
The default backup behavior is to generate a single large tarball.
It is probably reasonable to only support SKIP=tar, to avoid the need to untar the previous backup to read checksums and extract Git bundles of unchanged repositories.
Links / references
https://docs.gitlab.com/ee/raketasks/backup_restore.html#skipping-tar-creation