2023-10-19: Disk Space Utilization gitaly service (main stage) running low
Customer Impact
Current Status
In https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17002+ we started draining new Gitaly servers. For this, we use gitalyctl
where we configure it to drain specific Gitaly nodes. When gitalyctl
schedules a move of the repository, it doesn't provide destination storage, unless it's part of a pool repository (which is used for forks). If the repository is part of a fork network, we'll move that repository in the server of the root repository so that it joins the pool repository again.
When https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17002+ started, we found a large number of projects (5TB) of repositories that are in a fork network that are not next to the root repository, so gitalyctl
moved them there, which happened to be nfs-file9[0-2]
even though they didn't have any weight and didn't accept new traffic.
Next Steps
2022-10-19:
- We leave it like it is, growth is minimal so we won't fill the disk
- EOC can use
glsh
to migrate some repositories👉 #17004 (comment 1611124179)
2022-10-20:
-
@sxuereb: Implement a fix to not migrate to old gitaly servers or servers that don't have a weight. 👉 woodhouse!324 (merged) -
@sxuereb: Start draining nfs-file9[0-2]
viagitalyctl
to bring them down to 85% like there where before.👉 #17004 (comment 1612662067)-
Version bump 👉 gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!3378 (merged) -
Configure gitalyctl
to drainnfs-file9[0-2]
👉 gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!3379 (merged) -
Disable draining 👉 gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!3382 (merged)
-
-
@sxuereb: Increase frequency of shard weight assigner. 👉 #17004 (comment 1612031408) -
@sxuereb: Investigate the forks, to see if they are legit forks.Cancelling due to time constrain
📚 References and helpful links
Recent Events (available internally only):
- Feature Flag Log - Chatops to toggle Feature Flags Documentation
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Deployment Guidance
- Deployments Log | Gitlab.com Latest Updates
- Reach out to Release Managers for S1/S2 incidents to discuss Rollbacks, Hot Patching or speeding up deployments. | Rollback Runbook | Hot Patch Runbook
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.