Investigate Disk Full Errors on our Deploy nodes
Problem Description
Many of our deploy nodes began to run low on disk space this week. After investigation, they are still being cleaned up properly. Let's take a look at one target deployment that failed: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/1825310
- The warmup job runs a step that clears out the downloads: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/9628424#L121
- The migrations job failed to upgrade the deployer node due to being out of disk space: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/9628428#L193
If we hop onto the node, sure enough, we only have that single package downloaded:
root@deploy-cny-01-sv-gprd.c.gitlab-production.internal:~# ls -lah /var/cache/deploy-tooling/deb/
total 1.4G
drwxr-xr-x 2 root root 4.0K Mar 29 14:47 .
drwxr-xr-x 3 root root 4.0K Dec 22 16:25 ..
-rw-r--r-- 1 root root 1.4G Mar 29 14:47 gitlab-ee_15.11.202303291305-96f6b28da7b.0722eed4879_amd64.deb
And we are low but not out of space:
root@deploy-cny-01-sv-gprd.c.gitlab-production.internal:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 3.6G 0 3.6G 0% /dev
tmpfs 746M 75M 671M 10% /run
/dev/sda1 20G 12G 7.4G 62% /
...
Currently this is a headache as we are occasionally alerting the on-call of Disk Saturation, but we cross this threshold DURING a deployment. To which we are normally already asking help for since a deploy job will have failed.
Current Remediation Options
- clean out the deb download directory path
/var/cache/deploy-tooling/deb/*
- Run an apt-clean operation
Milestones
-
Investigate if our 1.4G download expands beyond 7.4G (highly unlikely) -
Investigate to validate that Ansible is cleaning up as desired -
Determine how to resolve our disk space issues
Edited by John Skarbek