Failed to cleanup volumes after upgrade to GitLab Runner 13.4.1
Status update: 2022-01-31
We've just merged !3269 (merged) which should fix or mostly fix this.
We create additional docker volumes in a few different contexts:
- When the
GIT_STRATEGY
isclone
/none
(temporary volume) - When the
GIT_STRATEGY
isfetch
(cache volume) - When cache volumes are disabled (temporary volume)
When the temporary volume feature was added, we relied on volumes being removed at the end of each job if they were marked as temporary.
Unfortunately, the volume names used were consistent across multiple jobs. When an error occurs and a temporary volume cannot be removed, two things can happen:
- A volume with the same name is used in future jobs, but because it already exists, it is simply re-used. This is why data is being retained from previous jobs.
- Jobs will repeatably show an error that the volume cannot be removed. This is because it's still attached to the first container where volume removal failed.
!3269 (merged) now uses a temporary name for each and every job for temporary volumes. This should mean that:
- Data won't be retained between jobs when caching is disabled and when not using
GIT_STRATEGY=fetch
- Jobs won't repeatably show this error when a volume is stuck being removed.
What it doesn't fix are problems where the container and temporary volume were not able to be removed in the first instance. This can happen if Runner is stopped without jobs first being drained or if Docker encounters an error (which it can if disk IO is poor and removal takes too long). If a container does have this problem, it will likely have to be manually removed, but it should no longer cause the problems everybody here has been experiencing.
For anybody that has this problem, it's probably worth running docker volume ls
and seeing what volumes you have that are still attached to stale containers that we're unable to be removed. If you've found a unique case where a volume wasn't able to be removed, please let us know.
If anybody is able to test out the bleeding version of Runner (https://docs.gitlab.com/runner/install/bleeding-edge.html) that includes this fix, I'd really appreciate it.
Summary
After upgrade to GitLab Runner 13.4.1, we randomly get the following errors at the end of pipeline jobs:
ERROR: Failed to cleanup volumes
It seems that the volume cleanup was added within commit 469a185f.
Relevant logs
The related error from syslog:
Sep 30 11:57:18 eu-gr2 gitlab-runner[8416]: #033[31;1mERROR: Failed to cleanup volumes #033[0;m #033[31;1merror#033[0;m=remove temporary volumes: Error response from daemon: remove runner-wmzkdr33-project-66-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8: volume is in use - [cb41210023c491c8a1777cba34358faa95de87d2d63993493641c052700c7013] (manager.go:220:0s) #033[31;1mjob#033[0;m=58260 #033[31;1mproject#033[0;m=66 #033[31;1mrunner#033[0;m=WMzkDr33
Environment description
We are using custom GitLab Runner installation (version 13.4.1) with Docker (DinD) executor.
docker info
:
Client:
Debug Mode: false
Server:
Containers: 63
Running: 3
Paused: 0
Stopped: 60
Images: 70
Server Version: 19.03.11
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 35bd7a5f69c13e1563af8a93431411cd9ecf5021
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.19.0-6-amd64
Operating System: Debian GNU/Linux 10 (buster)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 11.73GiB
Name: eu-gr2
ID: TI6D:J2ZY:7MYY:YPZS:P6SW:ZUBT:TKFF:UZ44:3YDD:XOZI:SS27:HFGB
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
config.toml
:
concurrent = 8
check_interval = 0
[session_server]
listen_address = "0.0.0.0:33338"
advertise_address = "XXXXX"
session_timeout = 1800
[[runners]]
name = "eu-gr2"
url = "XXXXX"
token = "XXXXX"
executor = "docker"
environment = ["DOCKER_TLS_CERTDIR="]
[runners.docker]
tls_verify = false
image = "docker:stable"
privileged = true
disable_cache = false
volumes = ["/cache", "/certs/client"]
environment = ["DOCKER_DRIVER=overlay2"]
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]