We have a docker executor running to allow different projects using different technologies to CI/CD. We have had to write a set of small clean up routines to make sure that the system doesn't run out of resources because gitlab-runner does not remove the containers that it uses and if your environment is busy this can add up very quickly.
Proposal
GitLab is by design not automatically clearing this cache because we use it to speed up pipeline execution. That said, this cache can grow large, so there is actually already a script bundled with the runner called /usr/share/gitlab-runner/clear-docker-cache which will handle this for you.
We should make it clear in the Runner documentation that this is the case, and a cron job (or custom scheduled job associated with disk space usage limits) can be created to run this script.
Note that this issue was originally reported as a ~bug, please see below for case where the right approach was unclear/not easy to discover.
Steps to reproduce
Run the shared runner with the following configuration, register the shared runner with gitlab as shared, run a job that uses the runner. A minute after the job is finished you'll see the container in "Exited" state and it remains there.
Yes, this is a serious problem that filled up all the disk space on our private runner. I had to add the infamous cronjob to clean up, but that does not feel right :-(
A simple docker system prune -a -f on a weekly cronjob is what I'm using now (see docs). I've got a single gitlab-runner server hosting all our dockerized gitlab runners and was running into the same issue. This works in a pinch assuming you don't leave unused images/containers/networks/volumes lying around that you intend to re-use. Easy enough when you have a dedicated build server. That said though, I'd very much like a better option as well, precisely how I found this issue.
@ayufan and @erushton this seems to be incorrect behavior - one of the links indicates there may be code to clean this up that is not working. Can you please take a look and determine LOE to resolve this?
@jlenny @kristmcg By design, we preserve some number of container/volumes for cache purposes, they are named runner-ID-cache-HASH. We use that to speed-up subsequent runs on the same machine. The specific of GitLab Runner can be disabled with disable_cache=true in config.toml.
Without knowing what containers are left I assume that these are caches.
@ayufan Can you explain more about "single build" , I'm not quite understanding? How does disable_cache = true have an impact or not in a scenario where there are many Gitlab Runners across several VM's ? And where there might be multiple projects/pipelines but no re-use between each project that gets built.
Experiencing this as well. The machine where my build runner runs on has a very limited storage. It builds the qt lib from scratch which occupies about 3GB of space per build. That means, I can only run my ci pipeline once before I have do manually clean the cached volumes.
First let's describe how containers are handled by Runner, since this will put a light on why some containers are left.
Runner, when docker executor is used, creates several containers to handle one job:
the main container, where user's script is executed,
containers for all specified services,
a number of containers with predefined in name, where git clone/fetch, artifacts and cache operations are done,
a container with cache in name, where build's directory is stored for use in future job. Thanks to this container one may use a fetch strategy in future jobs, without it, docker executor would support only clone strategy and would not preserve files created during previous jobs (which may be useful for the user; this tires to replicate the way how Runner with shell executor may be used). There may be also other containers with -cache- in name, if the user will connect a volumes via volumes = [] setting, but not specify a host directory - then a cache container will be created to store the content of such volume.
From all these containers, all - except the *-cache-* containers are removed. When knowing what cache containers are for, it's obvious that they may not be removed. Otherwise they would not give the value that we expect from them (to preserve data between jobs).
It's important to know, that these cache containers have nothing to do with cache defined in .gitlab-ci.yml nor the remote cache configuration (S3, GCS) from config.toml. Both features can be configured and used independently. These are just cache mechanisms for two separated layers of the system.
In some cases (probably like the one here) user may decide that he don't want to use the docker cache feature (and I think this is the case that creates the problem of this issue). It may be disabled with disable_cache = true in [runners.docker] section for a specific Runner. The setting and the difference of 2 caching layers available for docker executor are described at https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-docker-section.
And for example - this is how a list of exited containers looks on my local machine, after I started a Runner and executed four jobs:
f1e5b79deef6 Exited (0) 2 minutes ago runner-c8d11a2a-project-8472090-concurrent-2-cache-15a0ecfd391da7edde863f3b2ea173978a746c95a67b Exited (0) 3 minutes ago runner-c8d11a2a-project-8472090-concurrent-1-cache-15a0ecfd391da7edde863f3b2ea173978f70cc6eaacd Exited (0) 3 minutes ago runner-c8d11a2a-project-8472090-concurrent-0-cache-15a0ecfd391da7edde863f3b2ea17397
Only cache containers are present. Rest of containers used for the job was removed.
Without knowing what containers are left I assume that these are caches.
I don't see any clear statement that would confirm or deny that this is the case, so I'm assuming that it is. Normally no containers - except *-cache-* ones - should be left and this is how Runner works on my machine and on other machines I'm managing. And for that the disable_cache setting is the way to go (again, this doesn't affect caches defined in .gitlab-ci.yml).
I think that what's mostly expected by all people here would be to make this tool a part of Runner. After refactoring this tool and putting it inside of Runner there are at least two possibilities:
We may start this as a separated goroutine at Runner's startup, but only when Docker or Docker Machine executor is used. In that case the tool behaves just like it behaves now, but one doesn't need to install and configure another tool - all is placed in one place.
We may make it a one-shot mechanism that would be started on each job start. As part of prepare process we would just execute it to try to proclaim some free space (if configured requirements of used space are met).
The first one makes the job start faster, while the second one saves a little CPU power since it's started only when a free space may be needed.
But in any of these two cases, it should ensure that:
the maintenance of cleanup mechanism gets the same level of attention as the Runner's maintenance (since it's now a part of Runner),
in case of changes in Runner it would be easier to remember about updating the cleanup mechanism.
Implementing this inside of Runner may be a little tricky for Docker Machine, since Docker is not available locally there, but it should be not hard to make it working also for this executor.
So, as a summary I propose to:
Extend our documentation to clearly describe a difference between different cache mechanisms that Runner is using, so the related configuration options will not confuse users.
@tmaczukin Thanks for the long writeup. I also came here, like others, because I was running out of disk space on my GitLab Runner server. We are a paying GitLab EE customer.
This is what really helped:
$ docker volume prune WARNING! This will remove all local volumes not used by at least one container.Are you sure you want to continue? [y/N] yDeleted Volumes:4578154e8b7ee9add613c07f94b84c1275b82c7beceb99491925482045d660bc2e9717177c0a7a0ce802c702da51df5e2ed8b63b89813624bbeef0d649b39685a4d57233b5074720a3d29b080f8de656b5586e5d44512f918b3a935b71f831cf...Total reclaimed space: 96.78GB
You speak about containers used for caching in your text - but how about these volumes? Are they also involved with the caching, or what is their purpose? Or is this just a bug, that containers are removed but their volumes are left dangling? (since docker volume prune only removes unused volumes, I get the feeling that the latter is the case.)
For reference, here is a part of my config.toml - as can be seen, I have a cache volume enabled.
@jlenny I digged a bit further into this: the reason why my /var usage shrank drastically is because it executed /usr/share/gitlab-runner/clear-docker-cache.
I ran this command manually now; went down from about 85% to 9% usage on my /var volume.
Maybe the solution/workaround in this case would be to purge the Docker cache at a configurable interval?
This is done by design: we preserve cache/build volumes in form of containers/volumes as @tmaczukin described.
The script can be used to clean-up these containers. This is not ~bug, but rather a specification of the system and Runner should not manage these containers. This should be done externally. What we could do is to simply propose to run periodically (simple cronjob on weekend?) this script bundled with Runner to clean up the containers periodically and get rid of stale caches.
@ayufan Thanks for clearing things up. Maybe then this can be solved by adding some notes about this in the documentation. I set it up in /etc/cron.weekly/gitlab-clear-docker-cache like this, very trivial:
I'm also one of those people with 0 3 * * * /usr/bin/docker system prune -af in my crontab on several runners, which worked for a while. But since the volume of jobs has gone up, it's something I'd have to do every hour or so, so I went looking for better solutions.
This is a dedicated gitlab-runner machine with 20 GBs of space. Currently completely full.
/usr/share/gitlab-runner/clear-docker-cache doesn't do anything, presumably because gitlab-runner-docker-cleanup is also running.
docker volume prune cleaned up 4.7gb worth of volumes, there's still 15 gigs of something.
When I go into /var/lib/docker, it's the overlay2 directory that's massive, even with the cleanups above. The only thing that cleans that up is a docker system prune -af. So after searching for a more elegant solution, I've decided to 1: resize my machines and 2: change that cron job to run every 30 minutes :).
Happy to learn how to do it better, but this will probably work for a while.
In case this helps anyone else ... I run the following every hour ... so if the disk in question is more than 70% full then the clean up runs. Feel free to adjust as needed, but you can run it as often as you like (and your system can handle).
Adjust:
DRIVE should be the drive that either is mounted as /var/lib/docker or contains it
WATERLINE is the highest level you want the drive to be filled up before you run the cleaner.
This is my /etc/cron.hours/cleanUpAfterGitLabRunner:
So should we update https://docs.gitlab.com/runner/executors/docker.html to mention clear-docker-cache might need to be run regularly? Is the scope of this limited to just the Docker executor? Would an update there catch everyone
And what about folks that need to run docker system prune? Is that a problem with the script.
@eread Ideally yes, it can be used for kubernetes I imagine but that gets complex since the runner is deployed in the pod so it doesn't have privelaged access to the docker daemon so it really depends on the user set up. I think leaving it under the Docker executor is good.
And what about folks that need to run docker system prune? Is that a problem with the script.
It is not a problem with the script because the script does not clean build/cache volumes as specified in #2980 (comment 131281466) & #2980 (comment 106845694) so we can suggest docker system prune for a more aggressive approach but explain to the user of the downsides in doing so.
Below is a script I've just created, from a starting point of @awm's script above. There are a few differences:
Supports thresholds for both free space percentage and free inode percentage (defaults to 70% for each)
Dynamically determines the filesystem containing /var/lib/docker
Prunes only containers, images, and volumes (some may prefer to use a single docker system prune)
Can attempt to initially prune only Docker elements older than a given threshold in days - if that doesn't return enough disk then we resort to pruning everything
This could be enhanced to only prune volumes whose names match the 'runner-...' naming used by recent versions of GitLab runner.
#!/usr/bin/env shFREE_SPACE_PERCENT_THRESHOLD=${1:-70}# Disk space % usage thresholdFREE_INODE_PERCENT_THRESHOLD=${2:-70}# Inode % usage thresholdPRUNE_FILTER_UNTIL_DAYS=${3:-30}# Initially prune only this many daysUSE_PRUNE_FILTER_UNTIL=Y # Requires Docker CLI 1.14.0+ and API v1.26+DOCKER_STORAGE_DIR=/var/lib/dockerget_docker_storage_usage(){METRIC=$1echo$(df$DOCKER_STORAGE_DIR--output=$METRIC | tail-n-1 | grep-P-o"\d+")}prune_is_required(){[$(get_docker_storage_usage pcent)-gt$FREE_SPACE_PERCENT_THRESHOLD]||[$(get_docker_storage_usage ipcent)-gt$FREE_INODE_PERCENT_THRESHOLD]&&return 0}do_docker_prune(){DOCKER_PRUNE_FILTER="$(["$1"!=""]&&echo"--filter \"until=$(($1*24))h\"")" docker container prune -f"$DOCKER_PRUNE_FILTER" &> /dev/null docker image prune -af"$DOCKER_PRUNE_FILTER" &> /dev/null docker volume prune -f"$DOCKER_PRUNE_FILTER" &> /dev/null}if prune_is_required;thenecho"The filesystem containing $DOCKER_STORAGE_DIR exceeded a disk usage threshold; pruning docker elements"if["$USE_PRUNE_FILTER_UNTIL"="Y"];then# Initially prune only items older than the filter in days do_docker_prune $PRUNE_FILTER_UNTIL_DAYS# If that still hasn't freed enough space, omit the filter prune_is_required && do_docker_pruneelse# Prune everything do_docker_prunefifi
Based on the script by @ashleyghooper (thanks a lot!), I made the following modified version. There were a few problems with the previous implementation:
The $DOCKER_PRUNE_FILTER string should not be quoted inside a string. If it is, the Docker process will receive --filter "until as a single parameter, instead of separate parameters.
docker volume prune doesn't support the same kind of filters that the container prune and image prune commands, so the filtering was removed there altogether.
#!/usr/bin/env sh is incorrect for this script, which uses bashisms like &>.
(The first of these can easily be seen by disabling the &> /dev/null redirection. This means that in essence, the filtering would never work; it would always fall back to the full pruning.)
#!/usr/bin/env bashFREE_SPACE_PERCENT_THRESHOLD=${1:-70}# Disk space % usage thresholdFREE_INODE_PERCENT_THRESHOLD=${2:-70}# Inode % usage thresholdPRUNE_FILTER_UNTIL_DAYS=${3:-30}# Initially prune only this many daysUSE_PRUNE_FILTER_UNTIL=Y # Requires Docker CLI 1.14.0+ and API v1.26+DOCKER_STORAGE_DIR=/var/lib/dockerget_docker_storage_usage(){METRIC=$1echo$(df$DOCKER_STORAGE_DIR--output=$METRIC | tail-n-1 | grep-P-o"\d+")}prune_is_required(){[$(get_docker_storage_usage pcent)-gt$FREE_SPACE_PERCENT_THRESHOLD]||[$(get_docker_storage_usage ipcent)-gt$FREE_INODE_PERCENT_THRESHOLD]&&return 0}do_docker_prune(){DOCKER_PRUNE_FILTER="$(["$1"!=""]&&echo"--filter until=$(($1*24))h")" docker container prune -f$DOCKER_PRUNE_FILTER &> /dev/null docker image prune -af$DOCKER_PRUNE_FILTER &> /dev/null docker volume prune -f &> /dev/null}if prune_is_required;thenecho"The filesystem containing $DOCKER_STORAGE_DIR exceeded a disk usage threshold; pruning docker elements"if["$USE_PRUNE_FILTER_UNTIL"="Y"];then# Initially prune only items older than the filter in days do_docker_prune $PRUNE_FILTER_UNTIL_DAYS# If that still hasn't freed enough space, omit the filter prune_is_required && do_docker_pruneecho"Disk usage after cleanup: $(get_docker_storage_usage pcent)%"else# Prune everything do_docker_prunefifi
Thanks @perlun. Good catch on the quoting - I don't exactly remember specifics of my testing, but I think I'd wanted to avoid pruning a lot of hard-earned cache :), so stopped short of actually allowing the docker ... prune invocations, probably prefixing them with echo instead.
I also hadn't realised &> was a bashism; I normally try to avoid making scripts dependent on bash, but for some things it's better just to use it. I guess it'd be possible to revert to using > /dev/null 2>&1, but I think I'll just use your version.
@perlun, for me, in the context of gitlab-runner caches, being able to selectively remove volumes was the most important thing, so how about something like the below function? It could also be used in place of docker prune for other types if desired.
(Unfortunately for us, the version of Docker on our gitlab-runner servers (1.13.1) doesn't seem to include CreatedAt in the docker volume inspect output, so I can't actually use this myself yet).
Of course it'd be nice if docker would just add the missing filters for docker volume prune.
The ITEM_EPOCH= line could of course be replaced by another subshell within the ITEM_AGE_DAYS= calculation.
NB: For other Docker types like containers and images, which return much larger JSON objects, there is a small risk of the sed expression matching something that's not the creation date of the item itself, but rather another, nested Created or CreatedAt key. This could be avoided by using jq or similar, but trying to keep it simple here. docker volume inspect returns a flat JSON object.
@perlun, for me, in the context of gitlab-runner caches, being able to selectively remove volumes was the most important thing, so how about something like the below function? It could also be used in place of docker prune for other types if desired.
Yeah, something like that could work. In my case, the volumes are not "independent" (they are connected to the containers) so once the container is removed, they should be able to be safely removed for us. And before the container is removed, the docker volume prune won't touch them anyway.
(Unfortunately for us, the version of Docker on our gitlab-runner servers (1.13.1) doesn't seem to include CreatedAt in the docker volume inspect output, so I can't actually use this myself yet).
Sorry to hear that (that's a very old version, isn't it?). We have our CI runners mostly set up as "cattle", managing them with Puppet and Ansible, so we can easily run a fairly recent version of Docker on them.
Besides cleaning up volumes and something the above scripts don't take into account is many projects use old images and docker itself only keeps track of the created date for an image.. which is often months or years old thus flushing all the utilities and layers we required for fast jobs and hammering our docker repository every time we flushed the runners. What we found to be the best solution for image management is an LRU eviction cache that watches docker events, tracks the timestamp everytime an image is used and only does anything once a disk threshold has passed. This lets you get the most out of your docker image cache based on how much space you are willing to allocate on your runners.
For GitLab: it would be great that LRU and automatic management of the disk space would be at least an option or even better the default for docker runners? can this be prioritized? we shouldn't have to manually clean/manage disk space or install third party packages to do that?
It would probably be best to open a new issue though if you want Gitlab to possibly action anything since this one has been met with the 'Closed' hammer...
Having such a feature integrated into the runner process itself does not seem too far fetched to me. The other project is written in Rust tho and Gitlab Runner is Golang.. so it's a start from scratch effort.. but the implementation seems pretty simple overall.
The difficulty is convincing the runner team it's needed or something they should maintain.. and even then I wouldn't hold my breath on getting it for a year or two...
A lot of people are facing this issue and it doesn't seem to the team possible to implement it? Strange to force users to invent the same band-aids, again and again, thousands of times.
It's been 4 years but still the Runner's Team doesn't care about solving this issue. What's the point in publishing new features if old features are riddled with BUGS
I would not say that this is Gitlab Runner's bug. Docker is a separate component that gitlab-runner can use to run your CI jobs. Docker system fills up with used and unused images and overlays, this can be caused by gitlab-runner or by any other user or software that uses docker. To clean that up from gitlab-runner software, it does sound like a feature or improvement.
Add below cron job to your server where your gitlab-runner is setup, use https://crontab.guru/ to help you define the frequency you need.