Failed to cleanup volumes after upgrade to GitLab Runner 13.4.1
Status update: 2022-01-31
We've just merged !3269 (merged) which should fix or mostly fix this.
We create additional docker volumes in a few different contexts:
- When the
- When the
- When cache volumes are disabled (temporary volume)
When the temporary volume feature was added, we relied on volumes being removed at the end of each job if they were marked as temporary.
Unfortunately, the volume names used were consistent across multiple jobs. When an error occurs and a temporary volume cannot be removed, two things can happen:
- A volume with the same name is used in future jobs, but because it already exists, it is simply re-used. This is why data is being retained from previous jobs.
- Jobs will repeatably show an error that the volume cannot be removed. This is because it's still attached to the first container where volume removal failed.
!3269 (merged) now uses a temporary name for each and every job for temporary volumes. This should mean that:
- Data won't be retained between jobs when caching is disabled and when not using
- Jobs won't repeatably show this error when a volume is stuck being removed.
What it doesn't fix are problems where the container and temporary volume were not able to be removed in the first instance. This can happen if Runner is stopped without jobs first being drained or if Docker encounters an error (which it can if disk IO is poor and removal takes too long). If a container does have this problem, it will likely have to be manually removed, but it should no longer cause the problems everybody here has been experiencing.
For anybody that has this problem, it's probably worth running
docker volume ls and seeing what volumes you have that are still attached to stale containers that we're unable to be removed. If you've found a unique case where a volume wasn't able to be removed, please let us know.
If anybody is able to test out the bleeding version of Runner (https://docs.gitlab.com/runner/install/bleeding-edge.html) that includes this fix, I'd really appreciate it.
After upgrade to GitLab Runner 13.4.1, we randomly get the following errors at the end of pipeline jobs:
ERROR: Failed to cleanup volumes
It seems that the volume cleanup was added within commit 469a185f.
The related error from syslog:
Sep 30 11:57:18 eu-gr2 gitlab-runner: #033[31;1mERROR: Failed to cleanup volumes #033[0;m #033[31;1merror#033[0;m=remove temporary volumes: Error response from daemon: remove runner-wmzkdr33-project-66-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8: volume is in use - [cb41210023c491c8a1777cba34358faa95de87d2d63993493641c052700c7013] (manager.go:220:0s) #033[31;1mjob#033[0;m=58260 #033[31;1mproject#033[0;m=66 #033[31;1mrunner#033[0;m=WMzkDr33
We are using custom GitLab Runner installation (version 13.4.1) with Docker (DinD) executor.
Client: Debug Mode: false Server: Containers: 63 Running: 3 Paused: 0 Stopped: 60 Images: 70 Server Version: 19.03.11 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 35bd7a5f69c13e1563af8a93431411cd9ecf5021 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd init version: fec3683 Security Options: apparmor seccomp Profile: default Kernel Version: 4.19.0-6-amd64 Operating System: Debian GNU/Linux 10 (buster) OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 11.73GiB Name: eu-gr2 ID: TI6D:J2ZY:7MYY:YPZS:P6SW:ZUBT:TKFF:UZ44:3YDD:XOZI:SS27:HFGB Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
concurrent = 8 check_interval = 0 [session_server] listen_address = "0.0.0.0:33338" advertise_address = "XXXXX" session_timeout = 1800 [[runners]] name = "eu-gr2" url = "XXXXX" token = "XXXXX" executor = "docker" environment = ["DOCKER_TLS_CERTDIR="] [runners.docker] tls_verify = false image = "docker:stable" privileged = true disable_cache = false volumes = ["/cache", "/certs/client"] environment = ["DOCKER_DRIVER=overlay2"] [runners.cache] [runners.cache.s3] [runners.cache.gcs]