Diskusage growing rapidly

Support Request for the Gitaly Team

This request template is part of Gitaly Team's intake process.

Customer Information

Salesforce Link:

Installation Size:

Architecture Information:

Slack Channel:

Additional Information:

Support Request

Severity

In one to two weeks (one of) our GitLab server(s) will run out of diskspace and become unusable. That points towards the priority being "Urgent" (or at least "High"), but a workaround might exist, that would allow us to keep GitLab running (although implementation might be difficult).

Problem Description

From the night to 6 May, an extra approximately 170GiB of disk space have been used every night.

Troubleshooting Performed

From the graph where I could see the diskusage growing, it seemed like the backup task wasn't doing it's normal cleanup, and a simple calculation of diskusage on the machine hosting that GitLab server showed that most space was used in /var/opt/gitlab/backups.

Looking closer at the usage of that directory, shows that most of the usage is in /var/opt/gitlab/backups/repositories/@hashed where it is distributed between projects.

Looking at one of our larger projects, the web UI ("Admin Area" -> "Projects" -> "") shows the following for storage:

Storage: 230.6 GB (Repository: 42.7 MB / Wikis: 0 Bytes / Build Artifacts: 230.5 GB / Pipeline Artifacts: 0 Bytes / LFS: 0 Bytes / Snippets: 0 Bytes / Packages: 0 Bytes / Uploads: 54.2 MB)

(the build artifacts are on object storage, and while that number seems a bit excessive, we're fine with it), but du -hc /var/opt/gitlab/backups/repositories/\@hashed/3c/36/3c365ff931ecb0e3c0f00231793fe32151463bfcc31a4fdf4eb0a5942f5b1ddb (I've found that hash in the admin, on the page mentioned above) shows usage of that directory to be 32GiB.

The mentioned directory contains directories for each of the days (both name and timestamp indicates that) since 6 May, each of roughly the same size (around 2.4GiB).

6 May might be the first backup done after I upgraded that server to 14.10 (I had been on vacation), combined with the discovery that the increased usage is due to backups, I suspect this is due to the incremental backup feature (merge request: !3937 (merged)), that sounds like a good thing, but is not something we've needed yet. What we do use (in fact I implemented that) is SKIP=tar. Our backup procedure is based on a systemd timer, running /usr/bin/gitlab-backup create CRON=1 SKIP=tar,registry STRATEGY=copy (and a daemon copying /var/opt/gitlab/backups to another server afterwards, that server must see the consequences of this too), and then GitLab is configured to remove backups older than 23 hours (with gitlab_rails['backup_keep_time'] = 82800 in gitlab.rb)

As the process overwrites old backups anyway, deletion of old backups might be disabled by SKIP=tar (it's been over two years since I did that, I don't remember all the details anymore) if the deletion of those files is added onto (or based on) that, that might explain what I see. In that case: is it safe to delete those files? That might serve as a workaround until compatibility between incremental backups and SKIP=tar has been restored.

We also (for security reasons) have a second GitLab server, with much less data, that shows a similar behaviour (but it has a lot more disk, so it won't run out as soon).

What specifically do you need from the Gitaly team

A fix to the code so diskusage doesn't continue to grow.

If the problem is the mentioned compatibility: A clear answer to whether the suggested workaround will work.

Author Checklist

Customer information provided
Severity realistically set
Clearly articulated what is needed from the Gitaly team to support your request by filling out the What specifically do you need from the Gitaly team

/cc @mjwood @andrashorvath

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information