GitLab Runner build failures for Docker deployments (Docker 29)
Summary
I'm using self hosted GitLab runners that were setup using this documentation: https://docs.gitlab.com/runner/configuration/runner_autoscale_aws/
I noticed today that our deployments fail because the health check for the DIND service suddenly seems to fail. This hasn't been a problem for months but started to occur some days ago.
Basically we're building Docker images, pushing it to to GitLab registry and for deployment, we pull it from GitLab and push it to AWS ECR.
Steps to reproduce
Just try to execute a job that's defined like the job in the following details section:
.gitlab-ci.yml
default:
image: docker:24.0.5-cli
services:
- name: docker:24.0.5-dind
variables:
HEALTHCHECK_TCP_PORT: "2375"
before_script:
- docker info
variables:
# 1) Name of directory where restore and build objects are stored.
OBJECTS_DIRECTORY: 'obj'
# 2) Name of directory used for keeping restored dependencies.
NUGET_PACKAGES_DIRECTORY: '.nuget'
# 3) A relative path to the source code from project repository root.
# NOTE: Please edit this path so it matches the structure of your project!
SOURCE_CODE_PATH: 'src/*/'
# Docker
DOCKER_DRIVER: overlay2
# When using dind service, you must instruct Docker to talk with
# the daemon started inside of the service. The daemon is available
# with a network connection instead of the default
# /var/run/docker.sock socket.
DOCKER_HOST: tcp://docker:2375
#
# The 'docker' hostname is the alias of the service container as described at
# https://docs.gitlab.com/ee/ci/services/#accessing-the-services.
#
# This instructs Docker not to start over TLS.
DOCKER_TLS_CERTDIR: ""
# Other variables redacted...
# ...
deploy:aws:
stage: deploy
image: registry.gitlab.com/gitlab-org/cloud-deploy/aws-base:latest
before_script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
script:
# 1) Pull from GitLab, retag, push to ECR
- docker pull --platform ${SERVER_PLATFORM_AWS} $CI_REGISTRY_IMAGE/api:${DOCKER_TAG}
- docker tag $CI_REGISTRY_IMAGE/api:${DOCKER_TAG} $AWS_ECR_REGISTRY/group/backend:${DOCKER_TAG}
- aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ECR_REGISTRY
- docker push $AWS_ECR_REGISTRY/group/backend:${DOCKER_TAG}
Actual behavior
Job fails with error:
Cannot connect to the Docker daemon at tcp://docker:2375. Is the docker daemon running?
Expected behavior
Job should be executed successfully as it did during the last months
Relevant logs and/or screenshots
job log
Running with gitlab-runner 18.3.0 (9ba718cd)
on gitlab-aws-autoscaler iEYRzzd8s, system ID: s_e59dad7ef83b
Preparing the "docker+machine" executor
00:03
Using Docker executor with image docker:24.0.5-cli ...
Starting service docker:24.0.5-dind...
Using effective pull policy of [always] for container docker:24.0.5-dind
Pulling docker image docker:24.0.5-dind ...
Using docker image sha256:7015f2c475d511a251955877c2862016a4042512ba625ed905e69202f87e1a21 for docker:24.0.5-dind with digest docker@sha256:3c6e4dca7a63c9a32a4e00da40461ce067f255987ccc9721cf18ffa087bcd1ef ...
Waiting for services to be up and running (timeout 180 seconds)...
*** WARNING: Service runner-ieyrzzd8s-project-486-concurrent-0-81da683409ed8f1b-docker-0 probably didn't start properly.
Health check error:
service "runner-ieyrzzd8s-project-486-concurrent-0-81da683409ed8f1b-docker-0-wait-for-service" health check: exit code 1
Health check container logs:
2025-11-11T07:59:46.462387831Z FATAL: No HOST or PORT found
Service container logs:
2025-11-11T07:59:46.295360051Z time="2025-11-11T07:59:46.295243880Z" level=info msg="Starting up"
2025-11-11T07:59:46.295762312Z time="2025-11-11T07:59:46.295656887Z" level=warning msg="Binding to IP address without --tlsverify is insecure and gives root access on this machine to everyone who has access to your network." host="tcp://0.0.0.0:2375"
2025-11-11T07:59:46.295778187Z time="2025-11-11T07:59:46.295682203Z" level=warning msg="Binding to an IP address, even on localhost, can also give access to scripts run in a browser. Be safe out there!" host="tcp://0.0.0.0:2375"
*********
Using effective pull policy of [always] for container docker:24.0.5-cli
Pulling docker image docker:24.0.5-cli ...
Using docker image sha256:99c502855bab44eb998644c302407cbbcebfb6dc2a6d9c892acb00c412ca1902 for docker:24.0.5-cli with digest docker@sha256:21d8477f7dcd514414b1ffea6670d9963f0c81d23303452bb3ad7f93fedacb64 ...
Preparing environment
00:01
Using effective pull policy of [always] for container sha256:446e9bb1f9f503abc0a8b81b04acbdceca703007eb5bd10f827b0292a88e9787
Running on runner-ieyrzzd8s-project-486-concurrent-0 via runner-ieyrzzd8s-gitlab-docker-machine-1762842181-5ae87b1b...
Getting source from Git repository
and
Executing "step_script" stage of the job script
00:01
Using effective pull policy of [always] for container docker:24.0.5-cli
Using docker image sha256:99c502855bab44eb998644c302407cbbcebfb6dc2a6d9c892acb00c412ca1902 for docker:24.0.5-cli with digest docker@sha256:21d8477f7dcd514414b1ffea6670d9963f0c81d23303452bb3ad7f93fedacb64 ...
$ docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
$ eval $(ssh-agent -s)
Agent pid 29
$ ssh-add <(echo "$SSH_PRIVATE_KEY")
Identity added: /dev/fd/64 (redacted@redacted.local)
$ mkdir -p ~/.ssh
$ echo "$SSH_PRIVATE_KEY" >> ~/.ssh/id_rsa
$ chmod 600 ~/.ssh/id_rsa
$ echo "Host $REMOTE_HOST" >> ~/.ssh/config
$ echo "IdentityFile ~/.ssh/id_rsa" >> ~/.ssh/config
$ [[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config
$ apk add rsync
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz
(1/6) Installing libacl (2.3.1-r3)
(2/6) Installing lz4-libs (1.9.4-r4)
(3/6) Installing popt (1.19-r2)
(4/6) Installing libxxhash (0.8.2-r0)
(5/6) Installing zstd-libs (1.5.5-r4)
(6/6) Installing rsync (3.4.0-r0)
Executing busybox-1.36.1-r2.trigger
OK: 14 MiB in 28 packages
$ docker pull --platform ${SERVER_PLATFORM_DO} $CI_REGISTRY_IMAGE/api:${DOCKER_TAG}
Cannot connect to the Docker daemon at tcp://docker:2375. Is the docker daemon running?
Cleaning up project directory and file based variables
00:00
ERROR: Job failed: exit code 1
Environment description
config.toml contents
concurrent = 4
check_interval = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "gitlab-aws-autoscaler"
limit = 4
url = "https://gitlab.dev.local"
token = "redacted"
executor = "docker+machine"
[runners.cache]
Type = "s3"
Shared = true
[runners.cache.s3]
ServerAddress = "redacted"
AccessKey = "redacted"
SecretKey = "redacted"
BucketName = "redacted"
BucketLocation = "redacted"
[runners.docker]
tls_verify = false
image = "docker:27.4"
privileged = true
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = true
shm_size = 0
environment = ["LC_ALL=en_US.UTF-8", "TERM=xterm"]
wait_for_services_timeout = 180
[runners.machine]
IdleCount = 0
IdleTime = 1800
MaxBuilds = 25
MachineDriver = "amazonec2"
MachineName = "gitlab-docker-machine-%s"
MachineOptions = [
"amazonec2-access-key= redacted",
"amazonec2-secret-key= redacted",
"amazonec2-region=eu-central-1",
"amazonec2-vpc-id=redacted",
"amazonec2-subnet-id=redacted",
"amazonec2-use-private-address=true",
"amazonec2-zone=b",
"amazonec2-tags=runner-manager-name,gitlab-aws-autoscaler,gitlab,true,gitlab-runner-autoscale,true",
"amazonec2-security-group=docker-machine-scaler",
"amazonec2-instance-type=m4.xlarge",
"amazonec2-ami=ami-0faab6bdbac9486fb",
"amazonec2-root-size=24",
"amazonec2-request-spot-instance=true",
]
Used GitLab Runner version
Possible fixes
Edited by Ben