GitLab Runner with docker-autoscaler not reusing available cache volumes
Summary
I recently migrated our self-hosted runner executor from docker+machine to docker autoscaler, because the former will be EOL at the end of 2024.
Suddenly we were running into a lot of No space left on device errors.
Upon closer inspection I found out that the disk on the single VMs, which is 100GB, is running full because there are a lot of docker volumes. This does not happen after a few weeks or months, but after a few hours. Increasing the disk space is not an option, because even 150GB is full very quickly and frankly it seems like a waste of money considering the fact, that the total amount of disk space a single job could ever need is 25GB (repository + various caches)
3.9G runner-<runner-id>-project-<project-id>-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
3.3G runner-<runner-id>-project-<project-id>-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
73M runner-<runner-id>-project-<project-id>-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
3.3G runner-<runner-id>-project-<project-id>-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
5.6G runner-<runner-id>-project-<project-id>-concurrent-10-cache-3c3f060a0374fc8bc39395164f415a70
4.2G runner-<runner-id>-project-<project-id>-concurrent-10-cache-c33bcaa1fd2c77edfc3893b41966cea8
7.2G runner-<runner-id>-project-<project-id>-concurrent-11-cache-3c3f060a0374fc8bc39395164f415a70
3.5G runner-<runner-id>-project-<project-id>-concurrent-11-cache-c33bcaa1fd2c77edfc3893b41966cea8
7.2G runner-<runner-id>-project-<project-id>-concurrent-12-cache-3c3f060a0374fc8bc39395164f415a70
5.4G runner-<runner-id>-project-<project-id>-concurrent-12-cache-c33bcaa1fd2c77edfc3893b41966cea8
5.0G runner-<runner-id>-project-<project-id>-concurrent-13-cache-3c3f060a0374fc8bc39395164f415a70
4.9G runner-<runner-id>-project-<project-id>-concurrent-13-cache-c33bcaa1fd2c77edfc3893b41966cea8
4.4G runner-<runner-id>-project-<project-id>-concurrent-7-cache-3c3f060a0374fc8bc39395164f415a70
3.4G runner-<runner-id>-project-<project-id>-concurrent-7-cache-c33bcaa1fd2c77edfc3893b41966cea8
1.4G runner-<runner-id>-project-<project-id>-concurrent-8-cache-3c3f060a0374fc8bc39395164f415a70
5.4G runner-<runner-id>-project-<project-id>-concurrent-8-cache-c33bcaa1fd2c77edfc3893b41966cea8
I know there already some issues talking about this:
Apparently, at the moment the GitLab Runner is not cleaning up those volumes by design, however, my question is rather why the cache volumes are actually recreated at all.
This is not happening with docker+machine, but only with the docker autoscaler. The config is more or less the same. Especially nothing concerning the cache was changed. For the docker+machine the "concurrent-id" is always 0.
The volumes contain the same things, the repository and some caches from gradle, pnpm and the likes. Why is the cache suddenly not reused for subsequent jobs? Probably because the concurrent id is incremented, but there is always only one job running on any given machine (capacity_per_instance is 1).
The project-id is also always the same, it's just the concurrent-id and the hash at the end which are changing.
Actual behavior
Sequential jobs (capacity_per_instance=1) for the same project on the same machines don't always reuse the cache from the previous job (although they should), but get their own new ones, which contain basically the same data.
The sizes just differ because different jobs might download more dependencies for e.g. gradle, but the basis is the same.
This results in a lot of unused docker volumes piling up which is filling up the available disk space quickly. The current workarounds for this problem in general are not good enough for that short time-frame where this is happening.
Expected behavior
Caches should be reused whenever possible.
Configuration
We have no additional configuration regarding the cache in any .gitlab-ci.yml file
Important parts of the runner config
[[runners]]
name = "docker-autoscaler-1"
url = "https://gitlab.com/"
token = "xxxxxx"
executor = "docker-autoscaler"
limit = 240 # Job limit
output_limit = 30000 # Maximum log size
# Directories
cache_dir = "/cache"
builds_dir = "/builds"
[runners.docker]
image = "ubuntu:24.04"
pull_policy = ["always"]
tls_verify = false
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
shm_size = 2000000000 # 2GB
volumes = [
"/var/run/docker.sock:/var/run/docker.sock",
"/cache",
"/builds",
]
[runners.autoscaler]
# Manually installed in the Dockerfile
plugin = "fleeting-plugin-googlecloud"
max_instances = 240 # Maximum number of instances
capacity_per_instance = 1 # How many jobs in parallel on a single VM
delete_instances_on_shutdown = false
[runners.cache]
Type = "gcs"
Path = "runner-cache"
Shared = true # Share between runners
[runners.cache.gcs]
CredentialsFile = "..."
BucketName = "..."
Used GitLab Runner version
- GitLab Runner: 17.2.0 (with
docker-autoscalerexecutor) - fleeting-plugin-googlecloud: 1.0.0
@dnsmichi (https://forum.gitlab.com/t/gitlab-runner-with-docker-autoscaler-not-reusing-available-cache-volumes/108382/3)