Skip to content

Add concurrency support for Git repository backups

Problem to solve

Backups protect against data loss events, and should be run regularly, and before maintenance and upgrades. For this reason, GitLab has a gitlab-backup script so that system administrators can backup everything. As the number and size of repositories increases, backing up repository data using the script becomes impossibly slow.

A significant reason why the script is slow is that repositories are backed up one by one. This underutilized the available CPU and memory resources, particularly when there are multiple Gitaly nodes. Adding concurrency will allow multiple repositories to be backed up at the same time, significantly reducing the time to take a backup.

Further details

Presuming sufficient CPU and memory, this could reduce backup time significantly.

Proposal

Add a new option to GIT_BUNDLE_CONCURRENCY to allow concurrent Git bundle commands to run.

  • The default value 0 would mean be the current serialized behavior
  • A value of 1...n would enable concurrency

When concurrency is enabled with a value of 4 this would mean:

  • each Gitaly storage would allow a maximum of 4 repos to be bundled at the same time
  • in a typical single storage/shard configuration this would mean 4 bundles being generated in parallel
  • in a 3 shard configuration this would mean 12 (= 4 concurrency x 3 shards) bundles being generated in parallel
Edited by James Ramsay (ex-GitLab)