Add concurrency support for Git repository backups
Problem to solve
Backups protect against data loss events, and should be run regularly, and before maintenance and upgrades. For this reason, GitLab has a gitlab-backup
script so that system administrators can backup everything. As the number and size of repositories increases, backing up repository data using the script becomes impossibly slow.
A significant reason why the script is slow is that repositories are backed up one by one. This underutilized the available CPU and memory resources, particularly when there are multiple Gitaly nodes. Adding concurrency will allow multiple repositories to be backed up at the same time, significantly reducing the time to take a backup.
Further details
Presuming sufficient CPU and memory, this could reduce backup time significantly.
Proposal
Add a new option to GIT_BUNDLE_CONCURRENCY
to allow concurrent Git bundle commands to run.
- The default value
0
would mean be the current serialized behavior - A value of
1...n
would enable concurrency
When concurrency is enabled with a value of 4
this would mean:
- each Gitaly storage would allow a maximum of 4 repos to be bundled at the same time
- in a typical single storage/shard configuration this would mean 4 bundles being generated in parallel
- in a 3 shard configuration this would mean 12 (= 4 concurrency x 3 shards) bundles being generated in parallel