Add concurrency support for Git repository backups
Problem to solve
Backups protect against data loss events, and should be run regularly, and before maintenance and upgrades. For this reason, GitLab has a gitlab-backup script so that system administrators can backup everything. As the number and size of repositories increases, backing up repository data using the script becomes impossibly slow.
A significant reason why the script is slow is that repositories are backed up one by one. This underutilized the available CPU and memory resources, particularly when there are multiple Gitaly nodes. Adding concurrency will allow multiple repositories to be backed up at the same time, significantly reducing the time to take a backup.
Further details
Presuming sufficient CPU and memory, this could reduce backup time significantly.
Proposal
Add a new option to GIT_BUNDLE_CONCURRENCY to allow concurrent Git bundle commands to run.
- The default value
0would mean be the current serialized behavior - A value of
1...nwould enable concurrency
When concurrency is enabled with a value of 4 this would mean:
- each Gitaly storage would allow a maximum of 4 repos to be bundled at the same time
- in a typical single storage/shard configuration this would mean 4 bundles being generated in parallel
- in a 3 shard configuration this would mean 12 (= 4 concurrency x 3 shards) bundles being generated in parallel