Runaway backup size with large numbers of community forks/merge requests may prevent a restore in a reasonable amount of disk space.
Summary
After enabling forks and merge requests on all 30,000 plus projects within the Drupal GitLab instance, we've just realized that the size of our backups grow by a full GB each day. After digging into the backup contents and our logs, we've realized this is not just because of the volume of activity, but because the backups don't seem to support the same git object deduplication that gitaly supports in a live environment.
In essence, this means that the backups are growing larger and larger - much larger than the actual disk use for merge requests in production, because in production they are deduplicated, and in backup they are not.
More urgently, because the git references are lost when making backups, this means that if we ever have to restore a backup the size on disk will be much larger! Possibly larger than we have available.
I've linked this in our Drupal migration issue, but want to make sure it got to your attention quickly.
Related to this epic and issue:
https://gitlab.com/groups/gitlab-org/-/epics/189
https://gitlab.com/gitlab-org/gitaly/-/issues/1355
"It would be better if all restored repositories are properly deduplicated during the restore, but in the interest of limiting scope I think we should cut that from the first iteration."
Steps to reproduce
- Examine a series of daily backups from a GitLab instance using the git object deduplication.
- Compare the actual production GitLab database/filetree side to the size of the back up for a given day.
- Note the divergence over time in production size vs. the backup size, as git objects are dereferenced and then stored as full clones in the backup.
What is the expected correct behavior?
Ideally, we would see the backups for the GitLab/Gitaly system managed in a way that preserves the Git ref system that allows for object deduplication between forks and their parent projects.
This could be:
- Simplifying backups to just .tar.gz copies of the database/filetree
- OR continuing to use the current git-object model that the backups provide, making both backup and restore process preserve git refs.
Relevant logs and/or screenshots
- TBA if we can find a good example
Output of checks
- TBA if we can find a good example
Results of GitLab environment info
- TBA if we can find useful output - but we believe this is vanilla behavior.
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)