Incident Review: Disk space above 90% for file-81
Incident Review
The on-call got paged that disk space for the gitaly service is violating our SLO. Upon further investigation we saw that file-81 was the one that is above 90% disk space (our SLO).
Usually, we stop sending new projects to a Gitaly node when it reaches 80%, so the only thing that could cause such an increase on file-81 is requests on existing projects that push new files or create new pack-objects. We did see a spike in requests which could be the cause but didn't explain why we see such a sustained big increase. We then noticed the weight on file-81 was 100, which means it will accept new projects and rails will favor this node, which is not what we want because we are above the 80% threshold.
After setting the weight to 0 manually and re-ran the weight assigner and it set it to 100 again which clearly shows that something is broke in the weight assigner.
Our main goal was to prevent disk saturation so we paused the weight-assigner job for the time being and set the weight to 0 manually so we can focus on evicting some projects of that gitaly node. We ended up using glsh gitaly repository move over the balancer project because we wanted more control on which repositories we move since we have some large repositories on file-81 for example 280GB one (investigating why we allow such a large repository in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17440). We ended up moving around 500GB in total. It was a good experience because we got to experience how long it takes to transfer a 120GB repository in the same zone, it took around 30 minutes, the impressive thing is that it was successful without any retries!
Now that we have free disk space again we could focus on why the weight assigner was setting a weight to 100 on file-81. Looking at the weight assigner it runs a query in thanos and checks the device /dev/sdb space which seems fine. However, we later noticed that file-81 data disk name is /dev/sbc which is different from /dev/sdb like the other gitaly nodes. file-81 had the log disk mounted to /dev/sdb, so it was only 12% full so for the weight assigner it was just fine. We've fixed this problem by using mountpoint=/var/opt/gitlab instead since the device name can be random, but we always mount in the same path.
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics