Skip to content

how/when are server-side backups pruned?

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Problem to solve

Gitaly seems to support sending repository backups to an object storage server, so-called "server-side backups".

In my first experiments with the system, the first backup took a whopping 200GiB of disk space. Then the second backup came, and again took another 200GiB of disk space, which was quite alarming.

I have since then added INCREMENTAL=yes to the backup job, but all of this is, frankly, seriously under-documented, and not clear at all. Even the fact that server-side backups now support incremental backups doesn't seem to be documented anywhere.

I could live with all this if I knew my data was safe and that my disks won't explode over the holidays. I have been living on thin hope with GitLab backups for years now, so I guess the former is business as usual, but I would surely prefer to not have any out of disk crisis while I'm trying to avoid eating dead animals by accident, for one. ;)

Since 16.6, there's some more documentation on how backups work, including an interesting activity diagram of how server-side backups work, but it doesn't tell me what i need to know about expiry.

Further details

The use case is a single-node GitLab self-hosted instance that has crossed the 1k user mark but so failed to make the transition to the proper 2k reference architecture. Specifically, this is about https://gitlab.torproject.org/ which is starting to host a rather large amount of Git repositories, especially multiple forks of Firefox (Tor Browser and Mullvad Browser).

Interestingly, our current on-disk usage in /var/opt/gitlab/git-data/repositories is 90.8GiB, including 63.8GiB @hashed and 27.3GiB @pools. So there's already a huge amplification in the backup space usage (~2x). I could also live with that, by the way.

I guess I'm Sidney in that little play of yours. ;)

Proposal

So there's a bunch of questions I'd like to have answered here:

  1. how is data expired from object storage when it's used by Gitaly to perform server-side backups?
  2. Am I supposed to implement object expiration policies on the object storage server?
  3. Or is that somewhat automatically taken care of by Gitaly?
  4. If so, what is the retention policy?

Who can address the issue

I guess the Gitaly folks would be the best to answer this. Considering he literally wrote the book on backups, I guess that @proglottis would be a good bet, but they also seem very busy so maybe some technical writer would be more appropriate. Not sure.

Other links/references

There's a handful of similar issues about gitaly documentation, backups, and specifically server-side backups.

  • #431454 - asks if server-side backups take up space on the gitaly server (the answer is "no", but is not actually documented in the manual)
  • gitaly#5691 (closed) - is a more generic "document this thing better" question, specifically "repository backups always need the gitlab-rails DB, where data is stored/ transmitted and praefect concerns"
  • !138129 (merged) - is an attempt at answering part of the above
  • gitaly!6475 (merged) - seem to have implemented support for incremental server-side backups, but has not updated documentation accordingly

@gitlab-bot label documentation Category:Gitaly groupgitaly docs-missing

Edited by 🤖 GitLab Bot 🤖