Skip to content

GitLab Runner with docker-autoscaler not reusing available cache volumes

Summary

I recently migrated our self-hosted runner executor from docker+machine to docker autoscaler, because the former will be EOL at the end of 2024.

Suddenly we were running into a lot of No space left on device errors.

Upon closer inspection I found out that the disk on the single VMs, which is 100GB, is running full because there are a lot of docker volumes. This does not happen after a few weeks or months, but after a few hours. Increasing the disk space is not an option, because even 150GB is full very quickly and frankly it seems like a waste of money considering the fact, that the total amount of disk space a single job could ever need is 25GB (repository + various caches)

3.9G	runner-<runner-id>-project-<project-id>-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
3.3G	runner-<runner-id>-project-<project-id>-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
73M	runner-<runner-id>-project-<project-id>-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
3.3G	runner-<runner-id>-project-<project-id>-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
5.6G	runner-<runner-id>-project-<project-id>-concurrent-10-cache-3c3f060a0374fc8bc39395164f415a70
4.2G	runner-<runner-id>-project-<project-id>-concurrent-10-cache-c33bcaa1fd2c77edfc3893b41966cea8
7.2G	runner-<runner-id>-project-<project-id>-concurrent-11-cache-3c3f060a0374fc8bc39395164f415a70
3.5G	runner-<runner-id>-project-<project-id>-concurrent-11-cache-c33bcaa1fd2c77edfc3893b41966cea8
7.2G	runner-<runner-id>-project-<project-id>-concurrent-12-cache-3c3f060a0374fc8bc39395164f415a70
5.4G	runner-<runner-id>-project-<project-id>-concurrent-12-cache-c33bcaa1fd2c77edfc3893b41966cea8
5.0G	runner-<runner-id>-project-<project-id>-concurrent-13-cache-3c3f060a0374fc8bc39395164f415a70
4.9G	runner-<runner-id>-project-<project-id>-concurrent-13-cache-c33bcaa1fd2c77edfc3893b41966cea8
4.4G	runner-<runner-id>-project-<project-id>-concurrent-7-cache-3c3f060a0374fc8bc39395164f415a70
3.4G	runner-<runner-id>-project-<project-id>-concurrent-7-cache-c33bcaa1fd2c77edfc3893b41966cea8
1.4G	runner-<runner-id>-project-<project-id>-concurrent-8-cache-3c3f060a0374fc8bc39395164f415a70
5.4G	runner-<runner-id>-project-<project-id>-concurrent-8-cache-c33bcaa1fd2c77edfc3893b41966cea8

I know there already some issues talking about this:

Apparently, at the moment the GitLab Runner is not cleaning up those volumes by design, however, my question is rather why the cache volumes are actually recreated at all.

This is not happening with docker+machine, but only with the docker autoscaler. The config is more or less the same. Especially nothing concerning the cache was changed. For the docker+machine the "concurrent-id" is always 0.
The volumes contain the same things, the repository and some caches from gradle, pnpm and the likes. Why is the cache suddenly not reused for subsequent jobs? Probably because the concurrent id is incremented, but there is always only one job running on any given machine (capacity_per_instance is 1).
The project-id is also always the same, it's just the concurrent-id and the hash at the end which are changing.

Actual behavior

Sequential jobs (capacity_per_instance=1) for the same project on the same machines don't always reuse the cache from the previous job (although they should), but get their own new ones, which contain basically the same data.
The sizes just differ because different jobs might download more dependencies for e.g. gradle, but the basis is the same.
This results in a lot of unused docker volumes piling up which is filling up the available disk space quickly. The current workarounds for this problem in general are not good enough for that short time-frame where this is happening.

Expected behavior

Caches should be reused whenever possible.

Configuration

We have no additional configuration regarding the cache in any .gitlab-ci.yml file

Important parts of the runner config

[[runners]]
    name = "docker-autoscaler-1"
    url  = "https://gitlab.com/"

    token    = "xxxxxx"
    executor = "docker-autoscaler"

    limit        = 240   # Job limit
    output_limit = 30000 # Maximum log size

    # Directories
    cache_dir  = "/cache"
    builds_dir = "/builds"

    [runners.docker]
        image        = "ubuntu:24.04"
        pull_policy = ["always"]

        tls_verify                   = false
        privileged                   = false
        disable_entrypoint_overwrite = false
        oom_kill_disable             = false
        disable_cache                = false
        shm_size                     = 2000000000 # 2GB

        volumes = [
            "/var/run/docker.sock:/var/run/docker.sock",
            "/cache",
            "/builds",
        ]

    [runners.autoscaler]
        # Manually installed in the Dockerfile
        plugin = "fleeting-plugin-googlecloud"

        max_instances                = 240   # Maximum number of instances
        capacity_per_instance        = 1     # How many jobs in parallel on a single VM
        delete_instances_on_shutdown = false

    [runners.cache]
        Type   = "gcs"
        Path   = "runner-cache"
        Shared = true           # Share between runners

        [runners.cache.gcs]
            CredentialsFile = "..."
            BucketName      = "..."

Used GitLab Runner version

  • GitLab Runner: 17.2.0 (with docker-autoscaler executor)
  • fleeting-plugin-googlecloud: 1.0.0

@dnsmichi (https://forum.gitlab.com/t/gitlab-runner-with-docker-autoscaler-not-reusing-available-cache-volumes/108382/3)