Regular backups taken of gitlab-org/gitlab on production
Tracks progress for https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/5227
Context
We are not currently using Gitaly's repository backups solution on gitlab.com due to scalability issues with large GitLab instances. As mentioned in the documentation:
The backup command produces a Git bundle for each repository and tars them all up. This duplicates pool repository data into every fork. In our testing, 100 GB of Git repositories took a little over 2 hours to back up and upload to S3. At around 400 GB of Git data, the backup command is likely not viable for regular backups. For more information, see alternative backup strategies.
This means backups of gitlab.com via this mechanism are untenable due to its large size.
Recently, server-side repository backups have been implemented, which stream repository bundles directly to object storage, avoiding a round-trip to the backup originator. This mechanism should eliminate the scalability limitation and allow for repositories in arbitrarily-sized GitLab instances to be efficiently backed up.
Additionally, server-side backups can be used in conjunction with incremental backups to further improve efficiency.
Proposal
In order to dogfood incremental server-side backups, we should enable the functionality on gitlab.com for the gitlab-org/gitlab
project. Once the dashboards are merged, we'll also be able to monitor the performance of the backups.
To set this up, we need to provision:
-
Object storage bucket -
A cronjob or similar to execute a direct repository backup