Gitlab runner random docker daemon connection errors in multi-runner pattern
Currently we are still using the docker-machine executor in the multi-runner pattern. Our multi-runners are configured to managed 4 sizes of ephemeral EC2 runners via docker-machine in the AZ the runner is located in.
For about 1 month, we are seeing random errors that we cannot explain due to how the job log output is being displayed.
The start of the logs for example show
unning with gitlab-runner 14.8.0 (565b6c0b)
on cdp-multirunner-aws-teq-prd-nat-1-east-1c-2-small JyQ43CFq
..
..
Running on runner-jyq43cfq-project-25009910-concurrent-0 via runner-jyq43cfq-nat-1-cdp-gitlab-east-1c-2-s-1645722586-cda8b903...
And then at any point during the job we can randomly see it fail with the following.
WARNING: Failed to inspect build container c40d9804ce97f4d8f2c63a82c517016ec975f896ad2515b9166102477748414a Cannot connect to the Docker daemon at tcp://100.75.121.26:2376. Is the docker daemon running? (docker_command.go:156:0s)
Authenticating with credentials from $DOCKER_AUTH_CONFIG
Pulling docker image golang:1 ...
WARNING: Failed to pull image with policy "always": Cannot connect to the Docker daemon at tcp://100.75.121.26:2376. Is the docker daemon running? (manager.go:203:0s)
Attempt #2: Trying "if-not-present" pull policy
Authenticating with credentials from $DOCKER_AUTH_CONFIG
Pulling docker image golang:1 ...
WARNING: Failed to pull image with policy "if-not-present": Cannot connect to the Docker daemon at tcp://100.75.121.26:2376. Is the docker daemon running? (manager.go:203:0s)
Cleaning up project directory and file based variables 00:00
Authenticating with credentials from job payload (GitLab Registry)
Pulling docker image registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-565b6c0b ...
WARNING: Failed to pull image with policy "always": Cannot connect to the Docker daemon at tcp://100.75.121.26:2376. Is the docker daemon running? (manager.go:203:0s)
Attempt #2: Trying "if-not-present" pull policy
Authenticating with credentials from job payload (GitLab Registry)
Pulling docker image registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-565b6c0b ...
WARNING: Failed to pull image with policy "if-not-present": Cannot connect to the Docker daemon at tcp://100.75.121.26:2376. Is the docker daemon running? (manager.go:203:0s)
ERROR: Failed to cleanup volumes
ERROR: Job failed (system failure): Cannot connect to the Docker daemon at tcp://100.75.121.26:2376. Is the docker daemon running?
I have traced back through the way things are running and cannot tell if these errors are being injected by the multi-runner or the ephemeral runner as they are 2 separate patterns.
Is the error:
Multi-runner -> Ephemeral via the docker-machine port tcp://<ip>:2376
or
Ephemeral -> Itself via tcp://<ip>:2376
Better demarcation of where each log element comes from could help identify where the issue lays greatly.