contradiction in git repository backups documentation about reliability of filesystem snapshots
Problem to solve
There is a big contradiction in the backup instructions for Git(aly) repositories. The backup repository data separately and snapshot backup and recovery limitations sections contradict each other.
The former says:
Back up repository data separately
First, ensure you back up existing GitLab data while skipping repositories:
sudo gitlab-backup create SKIP=repositories
For manually backing up the Git repository data on disk, there are multiple possible strategies:
- Use snapshots, such as the previous examples of Amazon EBS drive snapshots, or LVM snapshots + rsync.
The latter says:
Snapshot backup and recovery limitations
Gitaly Cluster does not support snapshot backups. Snapshot backups can cause issues where the Praefect database becomes out of sync with the disk storage. Because of how Praefect rebuilds the replication metadata of Gitaly disk information during a restore, you should use the official backup and restore Rake tasks.
The incremental backup method can be used to speed up Gitaly Cluster backups.
If you are unable to use either method, contact customer support for restoration help.
So I guess I'm doing the latter here: so which one is it? Can we or can we not use (say) LVM snapshots to backup Gitaly repositories?
It seems rather wasteful and disruptive to constantly stop Gitaly to backup those precious repositories, if that's the case I might as well setup a parallel Gitaly cluster (or just plain git repos) that regularly mirror the repositories, that can be done without shutting down the server(s)...
Further details
At the moment, we have removed repositories from the rake backup task, assuming our regular backup system (bacula) can properly backup Gitaly, so we're surprised and deeply concerned our backup's integrity might be at stake here.
Proposal
Clarify how to perform backups of Gitaly, either by removing or clarifying the reference to snapshots in both documents.
Who can address the issue
Unsure.
Other links/references
It looks like there was an attempt at fixing this in #385241 (closed) but it only added a warning on one of the two pages. There's an entire epic about scaling backups (#28780) as well, which touches on some of those issues. That epic at least links to the server-side repository backup approach which seems the state of the art right now as it seemingly can be performed without downtime.