Diskusage growing rapidly
Support Request for the Gitaly Team
This request template is part of Gitaly Team's intake process.
Customer Information
Salesforce Link:
Installation Size:
Architecture Information:
Slack Channel:
Additional Information:
Support Request
Severity
In one to two weeks (one of) our GitLab server(s) will run out of diskspace and become unusable. That points towards the priority being "Urgent" (or at least "High"), but a workaround might exist, that would allow us to keep GitLab running (although implementation might be difficult).
Problem Description
From the night to 6 May, an extra approximately 170GiB of disk space have been used every night.
Troubleshooting Performed
From the graph where I could see the diskusage growing, it seemed like the backup task wasn't doing it's normal cleanup, and a simple calculation of diskusage on the machine hosting that GitLab server showed that most space was used in /var/opt/gitlab/backups
.
Looking closer at the usage of that directory, shows that most of the usage is in /var/opt/gitlab/backups/repositories/@hashed
where it is distributed between projects.
Looking at one of our larger projects, the web UI ("Admin Area" -> "Projects" -> "") shows the following for storage:
Storage: 230.6 GB (Repository: 42.7 MB / Wikis: 0 Bytes / Build Artifacts: 230.5 GB / Pipeline Artifacts: 0 Bytes / LFS: 0 Bytes / Snippets: 0 Bytes / Packages: 0 Bytes / Uploads: 54.2 MB)
(the build artifacts are on object storage, and while that number seems a bit excessive, we're fine with it), but du -hc /var/opt/gitlab/backups/repositories/\@hashed/3c/36/3c365ff931ecb0e3c0f00231793fe32151463bfcc31a4fdf4eb0a5942f5b1ddb
(I've found that hash in the admin, on the page mentioned above) shows usage of that directory to be 32GiB.
The mentioned directory contains directories for each of the days (both name and timestamp indicates that) since 6 May, each of roughly the same size (around 2.4GiB).
6 May might be the first backup done after I upgraded that server to 14.10 (I had been on vacation), combined with the discovery that the increased usage is due to backups, I suspect this is due to the incremental backup feature (merge request: !3937 (merged)), that sounds like a good thing, but is not something we've needed yet. What we do use (in fact I implemented that) is SKIP=tar
. Our backup procedure is based on a systemd timer, running /usr/bin/gitlab-backup create CRON=1 SKIP=tar,registry STRATEGY=copy
(and a daemon copying /var/opt/gitlab/backups
to another server afterwards, that server must see the consequences of this too), and then GitLab is configured to remove backups older than 23 hours (with gitlab_rails['backup_keep_time'] = 82800
in gitlab.rb
)
As the process overwrites old backups anyway, deletion of old backups might be disabled by SKIP=tar
(it's been over two years since I did that, I don't remember all the details anymore) if the deletion of those files is added onto (or based on) that, that might explain what I see. In that case: is it safe to delete those files? That might serve as a workaround until compatibility between incremental backups and SKIP=tar
has been restored.
We also (for security reasons) have a second GitLab server, with much less data, that shows a similar behaviour (but it has a lot more disk, so it won't run out as soon).
What specifically do you need from the Gitaly team
A fix to the code so diskusage doesn't continue to grow.
If the problem is the mentioned compatibility: A clear answer to whether the suggested workaround will work.
Author Checklist
-
Customer information provided -
Severity realistically set -
Clearly articulated what is needed from the Gitaly team to support your request by filling out the What specifically do you need from the Gitaly team