Failed to cleanup volumes after upgrade to GitLab Runner 13.4.1

Status update: 2022-01-31

We've just merged !3269 (merged) which should fix or mostly fix this.

We create additional docker volumes in a few different contexts:

When the GIT_STRATEGY is clone/none (temporary volume)
When the GIT_STRATEGY is fetch (cache volume)
When cache volumes are disabled (temporary volume)

When the temporary volume feature was added, we relied on volumes being removed at the end of each job if they were marked as temporary.

Unfortunately, the volume names used were consistent across multiple jobs. When an error occurs and a temporary volume cannot be removed, two things can happen:

A volume with the same name is used in future jobs, but because it already exists, it is simply re-used. This is why data is being retained from previous jobs.
Jobs will repeatably show an error that the volume cannot be removed. This is because it's still attached to the first container where volume removal failed.

!3269 (merged) now uses a temporary name for each and every job for temporary volumes. This should mean that:

Data won't be retained between jobs when caching is disabled and when not using GIT_STRATEGY=fetch
Jobs won't repeatably show this error when a volume is stuck being removed.

What it doesn't fix are problems where the container and temporary volume were not able to be removed in the first instance. This can happen if Runner is stopped without jobs first being drained or if Docker encounters an error (which it can if disk IO is poor and removal takes too long). If a container does have this problem, it will likely have to be manually removed, but it should no longer cause the problems everybody here has been experiencing.

For anybody that has this problem, it's probably worth running docker volume ls and seeing what volumes you have that are still attached to stale containers that we're unable to be removed. If you've found a unique case where a volume wasn't able to be removed, please let us know.

If anybody is able to test out the bleeding version of Runner (https://docs.gitlab.com/runner/install/bleeding-edge.html) that includes this fix, I'd really appreciate it.

Summary

After upgrade to GitLab Runner 13.4.1, we randomly get the following errors at the end of pipeline jobs:

ERROR: Failed to cleanup volumes

It seems that the volume cleanup was added within commit 469a185f.

Relevant logs

The related error from syslog:

Sep 30 11:57:18 eu-gr2 gitlab-runner[8416]: #033[31;1mERROR: Failed to cleanup volumes                  #033[0;m  #033[31;1merror#033[0;m=remove temporary volumes: Error response from daemon: remove runner-wmzkdr33-project-66-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8: volume is in use - [cb41210023c491c8a1777cba34358faa95de87d2d63993493641c052700c7013] (manager.go:220:0s) #033[31;1mjob#033[0;m=58260 #033[31;1mproject#033[0;m=66 #033[31;1mrunner#033[0;m=WMzkDr33

Environment description

We are using custom GitLab Runner installation (version 13.4.1) with Docker (DinD) executor.

docker info:

Client:
 Debug Mode: false

Server:
 Containers: 63
  Running: 3
  Paused: 0
  Stopped: 60
 Images: 70
 Server Version: 19.03.11
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 35bd7a5f69c13e1563af8a93431411cd9ecf5021
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.19.0-6-amd64
 Operating System: Debian GNU/Linux 10 (buster)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 11.73GiB
 Name: eu-gr2
 ID: TI6D:J2ZY:7MYY:YPZS:P6SW:ZUBT:TKFF:UZ44:3YDD:XOZI:SS27:HFGB
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

config.toml:

concurrent = 8
check_interval = 0

[session_server]
  listen_address = "0.0.0.0:33338"
  advertise_address = "XXXXX"
  session_timeout = 1800

[[runners]]
  name = "eu-gr2"
  url = "XXXXX"
  token = "XXXXX"
  executor = "docker"
  environment = ["DOCKER_TLS_CERTDIR="]
  [runners.docker]
    tls_verify = false
    image = "docker:stable"
    privileged = true
    disable_cache = false
    volumes = ["/cache", "/certs/client"]
    environment = ["DOCKER_DRIVER=overlay2"]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]

Edited Feb 01, 2022 by Darren Eastman