Force job to fail early if the runner's starting disk space is too low, and automatically retry
Problem
We periodically encounter master-broken::infrastructurerunner-disk-full causing jobs to fail without any automated recovery options. A recent example is https://gitlab.com/gitlab-org/gitlab/-/jobs/7972763754.
How we have been "solving" this problem up to now
So far we have been recommended to raise the disk space as our best resolution. See this discussion on the reasoning behind why we opted for this instead of a proactive solution.
Other automation we have introduced to help us troubleshoot
We recently added disk utilization logging to gitlab-org/gitlab CI jobs before and after script execution. In this most recent example, the logging reveals that the runner started with only 1.7G
of available disk, while 97% of its total disk was occupied.
New proposal
-
force a selection of jobs to fail immediately (all
rspec
jobs,compile-production-assets
, etc) if thedf
script returns less than 2GB of available disk space.- We can start with MR pipelines first
- For the master pipelines, we can update the incident auto-triaging automation to recognize this early failure by its exit code, and label it with master-broken::infrastructurerunner-disk-full without actually seeing the
runner disk full
error message.
Implementation thought: We can run a script to process the output of the df
script and force it fail with a specified exit code if the available disk space is too low. With the exit code specified, we may be able to re-configure our jobs to benefit from this automatic retry ci configuration if this gets us a new runner, or instruct team member to manually click retry, similar to what we did with the PG::QueryCanceled error
.
Note: The 2GB
value can be adjusted, I just roughly estimated a job to require at least 2GB of disk space to run by looking at some of the recent failed RSpec jobs. For example, with this job, used disk was 23G
at the beginning of the job, and 25G
at the end, suggesting the entire job at least used 2GB.