2022-12-06: gstg-gitaly failing due to 'no space left on device
Current Status
At the moment we have 537G
free space, and it's not decreasing like it was, so the incident is IncidentResolved
The storage is also stable and no longer growing
What happened
On 2022-11-23 we started seeing disk space declining at a steady ready until it reached 0B
At first we've tried to move a bunch of projects with https://ops.gitlab.net/gitlab-com/gl-infra/balancer off file-01
however the disk space was recovering for a little while and then still reaching 0B again, so it was clear that there was something else filling up the disk. Running iotop
showed that gitaly
was doing the most writes, so it's something that the GitLab application was generating and not a run-away-process.
After that we ran dua to analyze the /var/opt/gitlab
disk and saw where most of the data was being taken. We later on found some projects that are 250GB large. We've ran manual housekeeping on them and managed to reduce disk space usage of those repositories by 50%.
We have a load generation that runs in CI/CD schedule pipelines, which sends requests to gitlab/gitlab-org
and a bunch of forks of these projects. Some of these RPCs such as FetchSourceBranch
create a tmp packfile for every request, but some of these requests were timing out and these packfiles were never getting cleaned up, which is being fixed in gitlab-org/gitaly#4520 (closed) (root cause)
We've manually run housekeeping for some of these large repositories (#8134 (comment 1199746430), #8134 (comment 1199758995), #8134 (comment 1199765292), and #8134 (comment 1202897890)). Running housekeeping cleans up the unused references and also allows FetchSourceBranch
to succeed. In gitlab-org/gitaly#3951 (closed) we plan to have housekeeping run not on just git-push(1)
but on other types of RPCs
📚 References and helpful links
Recent Events (available internally only):
- Feature Flag Log - Chatops to toggle Feature Flags Documentation
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Deployment Guidance
- Deployments Log | Gitlab.com Latest Updates
- Reach out to Release Managers for S1/S2 incidents to discuss Rollbacks and/or Hot Patching | Rollback Runbook | Hot Patch Runbook
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.