Job failing with "set volume permissions"
## Summary We have some Gitlab Runners using docker-autoscaler executor and are facing some sporadic issues when jobs start. Some of the jobs fail with ``` ERROR: Preparation failed: creating cache volume: set volume permissions: create permission container for volume "runner-t3dnh47v-project-597-concurrent-14-93c7e6ecb2f2d647-cache-dcf1cb88564cab74bbfd120b64688e93": Post "http://internal.tunnel.invalid/v1.44/containers/create?name=runner-t3dnh47v-project-597-concurrent-14-93c7e6ecb2f2d647-cache-dcf1cb88564cab74bbfd120b64688e93-set-permission-91da54d9bdf65b00": net/http: timeout awaiting response headers (linux_set.go:95:120s) ``` others with ``` ERROR: Failed to cleanup volumes ERROR: Job failed (system failure): Post "http://internal.tunnel.invalid/v1.44/containers/create?name=runner-7czrvakmf-project-597-concurrent-52-53ffe5f61d1fa794-build": net/http: timeout awaiting response headers (docker.go:687:120s) ``` This issue does not occur consistently across all jobs or instances (we are using AWS EC2), but it continues to happen intermittently. Regarding versions, we are running GitLab Runner and Helper version 17.8.2. For the operating system and Docker engine, we have tested: Ubuntu 20.04 with Docker 25.05 Ubuntu 24.04 with Docker 27.4.1 On the Docker engine side, even in debug mode, we have not found any errors, leading us to believe that Docker is not the root cause of this issue. As for instance resources, we are using m7i.8xlarge instances, which have ample CPU and memory, along with 1,000TB of storage, so resource constraints do not appear to be a factor. Our latest discovery is that the container responsible for setting up permissions at the start of each job does not seem to run when this error occurs. ``` 05d198d63adf a8035ead3055 "/usr/bin/dumb-init …" 15 hours ago Created runner-t3dnh47v-project-597-concurrent-14-93c7e6ecb2f2d647-cache-dcf1cb88564cab74bbfd120b64688e93-set-permission-91da54d9bdf65b00 ``` It has no logs on it, but the volume seems to be created: ``` local runner-t3dnh47v-project-597-concurrent-14-93c7e6ecb2f2d647-cache-dcf1cb88564cab74bbfd120b64688e93 ``` And if we inspect it, we are not able to see any difference from those that are being used by running jobs: ``` root@ip-10-55-21-177:/var/snap/amazon-ssm-agent/9881# docker volume inspect runner-t3dnh47v-project-597-concurrent-14-93c7e6ecb2f2d647-cache-dcf1cb88564cab74bbfd120b64688e93 [ { "CreatedAt": "2025-03-19T02:04:51Z", "Driver": "local", "Labels": { "com.gitlab.gitlab-runner.job.before_sha": "0000000000000000000000000000000000000000", "com.gitlab.gitlab-runner.job.id": "51502619", "com.gitlab.gitlab-runner.job.ref": "master_i3", "com.gitlab.gitlab-runner.job.sha": "34ed1d2ac433a5da09f4eecefe952ac255917902", "com.gitlab.gitlab-runner.job.url": "https://MY_GITLAB_URL/group_A/software-update-container-tools/-/jobs/51502619", "com.gitlab.gitlab-runner.managed": "true", "com.gitlab.gitlab-runner.pipeline.id": "9854959", "com.gitlab.gitlab-runner.project.id": "597", "com.gitlab.gitlab-runner.runner.id": "t3_Dnh47V", "com.gitlab.gitlab-runner.runner.local_id": "14", "com.gitlab.gitlab-runner.type": "cache" }, "Mountpoint": "/var/lib/docker/volumes/runner-t3dnh47v-project-597-concurrent-14-93c7e6ecb2f2d647-cache-dcf1cb88564cab74bbfd120b64688e93/_data", "Name": "runner-t3dnh47v-project-597-concurrent-14-93c7e6ecb2f2d647-cache-dcf1cb88564cab74bbfd120b64688e93", "Options": null, "Scope": "local" } ] ``` [Zendesk Ticket](https://gitlab.zendesk.com/agent/tickets/615230) - internal only ## Actual behaviour The majority of the jobs ran without issues, but a few of them failed due to described errors ## Expected behaviour All of the jobs ran, without problems. ## Environment description Are you using shared Runners on GitLab.com? Or is it a custom installation? We are using a custom installation of Gitlab Enterprise Which executors are used? Please also provide the versions of related tools Gitlab Runner versions tested: 17.8.2 and 17.9.0 OS: Ubuntu 20.04 and 24.04 Docker Engines: 25.05 and 27.4.1 What could the next steps be in terms of debugging?
issue