Skip to content

GitLab Runner build failures for Docker deployments (Docker 29)

Summary

I'm using self hosted GitLab runners that were setup using this documentation: https://docs.gitlab.com/runner/configuration/runner_autoscale_aws/

I noticed today that our deployments fail because the health check for the DIND service suddenly seems to fail. This hasn't been a problem for months but started to occur some days ago.

Basically we're building Docker images, pushing it to to GitLab registry and for deployment, we pull it from GitLab and push it to AWS ECR.

Steps to reproduce

Just try to execute a job that's defined like the job in the following details section:

.gitlab-ci.yml
default:
  image: docker:24.0.5-cli
  services:
      - name: docker:24.0.5-dind
        variables:
          HEALTHCHECK_TCP_PORT: "2375"
  before_script:
    - docker info

variables:
  # 1) Name of directory where restore and build objects are stored.
  OBJECTS_DIRECTORY: 'obj'
  # 2) Name of directory used for keeping restored dependencies.
  NUGET_PACKAGES_DIRECTORY: '.nuget'
  # 3) A relative path to the source code from project repository root.
  # NOTE: Please edit this path so it matches the structure of your project!
  SOURCE_CODE_PATH: 'src/*/'
  
  # Docker
  DOCKER_DRIVER: overlay2
  # When using dind service, you must instruct Docker to talk with
  # the daemon started inside of the service. The daemon is available
  # with a network connection instead of the default
  # /var/run/docker.sock socket.
  DOCKER_HOST: tcp://docker:2375
  #
  # The 'docker' hostname is the alias of the service container as described at
  # https://docs.gitlab.com/ee/ci/services/#accessing-the-services.
  #
  # This instructs Docker not to start over TLS.
  DOCKER_TLS_CERTDIR: ""

  # Other variables redacted...
  # ...

deploy:aws:
  stage: deploy
  image: registry.gitlab.com/gitlab-org/cloud-deploy/aws-base:latest
  before_script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
  script:
    # 1) Pull from GitLab, retag, push to ECR
    - docker pull --platform ${SERVER_PLATFORM_AWS} $CI_REGISTRY_IMAGE/api:${DOCKER_TAG}
    - docker tag $CI_REGISTRY_IMAGE/api:${DOCKER_TAG} $AWS_ECR_REGISTRY/group/backend:${DOCKER_TAG}
    - aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ECR_REGISTRY
    - docker push $AWS_ECR_REGISTRY/group/backend:${DOCKER_TAG}

Actual behavior

Job fails with error:

Cannot connect to the Docker daemon at tcp://docker:2375. Is the docker daemon running?

Expected behavior

Job should be executed successfully as it did during the last months

Relevant logs and/or screenshots

job log
Running with gitlab-runner 18.3.0 (9ba718cd)
  on gitlab-aws-autoscaler iEYRzzd8s, system ID: s_e59dad7ef83b

Preparing the "docker+machine" executor
00:03
Using Docker executor with image docker:24.0.5-cli ...
Starting service docker:24.0.5-dind...
Using effective pull policy of [always] for container docker:24.0.5-dind
Pulling docker image docker:24.0.5-dind ...
Using docker image sha256:7015f2c475d511a251955877c2862016a4042512ba625ed905e69202f87e1a21 for docker:24.0.5-dind with digest docker@sha256:3c6e4dca7a63c9a32a4e00da40461ce067f255987ccc9721cf18ffa087bcd1ef ...
Waiting for services to be up and running (timeout 180 seconds)...
*** WARNING: Service runner-ieyrzzd8s-project-486-concurrent-0-81da683409ed8f1b-docker-0 probably didn't start properly.
Health check error:
service "runner-ieyrzzd8s-project-486-concurrent-0-81da683409ed8f1b-docker-0-wait-for-service" health check: exit code 1
Health check container logs:
2025-11-11T07:59:46.462387831Z FATAL: No HOST or PORT found                      
Service container logs:
2025-11-11T07:59:46.295360051Z time="2025-11-11T07:59:46.295243880Z" level=info msg="Starting up"
2025-11-11T07:59:46.295762312Z time="2025-11-11T07:59:46.295656887Z" level=warning msg="Binding to IP address without --tlsverify is insecure and gives root access on this machine to everyone who has access to your network." host="tcp://0.0.0.0:2375"
2025-11-11T07:59:46.295778187Z time="2025-11-11T07:59:46.295682203Z" level=warning msg="Binding to an IP address, even on localhost, can also give access to scripts run in a browser. Be safe out there!" host="tcp://0.0.0.0:2375"
*********
Using effective pull policy of [always] for container docker:24.0.5-cli
Pulling docker image docker:24.0.5-cli ...
Using docker image sha256:99c502855bab44eb998644c302407cbbcebfb6dc2a6d9c892acb00c412ca1902 for docker:24.0.5-cli with digest docker@sha256:21d8477f7dcd514414b1ffea6670d9963f0c81d23303452bb3ad7f93fedacb64 ...

Preparing environment
00:01
Using effective pull policy of [always] for container sha256:446e9bb1f9f503abc0a8b81b04acbdceca703007eb5bd10f827b0292a88e9787
Running on runner-ieyrzzd8s-project-486-concurrent-0 via runner-ieyrzzd8s-gitlab-docker-machine-1762842181-5ae87b1b...

Getting source from Git repository

and

Executing "step_script" stage of the job script
00:01
Using effective pull policy of [always] for container docker:24.0.5-cli
Using docker image sha256:99c502855bab44eb998644c302407cbbcebfb6dc2a6d9c892acb00c412ca1902 for docker:24.0.5-cli with digest docker@sha256:21d8477f7dcd514414b1ffea6670d9963f0c81d23303452bb3ad7f93fedacb64 ...
$ docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
$ eval $(ssh-agent -s)
Agent pid 29
$ ssh-add <(echo "$SSH_PRIVATE_KEY")
Identity added: /dev/fd/64 (redacted@redacted.local)
$ mkdir -p ~/.ssh
$ echo "$SSH_PRIVATE_KEY" >> ~/.ssh/id_rsa
$ chmod 600 ~/.ssh/id_rsa
$ echo "Host $REMOTE_HOST" >> ~/.ssh/config
$ echo "IdentityFile ~/.ssh/id_rsa" >> ~/.ssh/config
$ [[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config
$ apk add rsync
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz
(1/6) Installing libacl (2.3.1-r3)
(2/6) Installing lz4-libs (1.9.4-r4)
(3/6) Installing popt (1.19-r2)
(4/6) Installing libxxhash (0.8.2-r0)
(5/6) Installing zstd-libs (1.5.5-r4)
(6/6) Installing rsync (3.4.0-r0)
Executing busybox-1.36.1-r2.trigger
OK: 14 MiB in 28 packages
$ docker pull --platform ${SERVER_PLATFORM_DO} $CI_REGISTRY_IMAGE/api:${DOCKER_TAG}
Cannot connect to the Docker daemon at tcp://docker:2375. Is the docker daemon running?

Cleaning up project directory and file based variables
00:00
ERROR: Job failed: exit code 1

Environment description

config.toml contents
concurrent = 4
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "gitlab-aws-autoscaler"
  limit = 4
  url = "https://gitlab.dev.local"
  token = "redacted"
  executor = "docker+machine"
  [runners.cache]
    Type = "s3"
    Shared = true
    [runners.cache.s3]
      ServerAddress = "redacted"
      AccessKey = "redacted"
      SecretKey = "redacted"
      BucketName = "redacted"
      BucketLocation = "redacted"
  [runners.docker]
    tls_verify = false
    image = "docker:27.4"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = true
    shm_size = 0
    environment = ["LC_ALL=en_US.UTF-8", "TERM=xterm"]
    wait_for_services_timeout = 180
  [runners.machine]
    IdleCount = 0
    IdleTime = 1800
    MaxBuilds = 25
    MachineDriver = "amazonec2"
    MachineName = "gitlab-docker-machine-%s"
    MachineOptions = [
      "amazonec2-access-key= redacted",
      "amazonec2-secret-key= redacted",
      "amazonec2-region=eu-central-1",
      "amazonec2-vpc-id=redacted",
      "amazonec2-subnet-id=redacted",
      "amazonec2-use-private-address=true",
      "amazonec2-zone=b",
      "amazonec2-tags=runner-manager-name,gitlab-aws-autoscaler,gitlab,true,gitlab-runner-autoscale,true",
      "amazonec2-security-group=docker-machine-scaler",
      "amazonec2-instance-type=m4.xlarge",
      "amazonec2-ami=ami-0faab6bdbac9486fb",
      "amazonec2-root-size=24",
      "amazonec2-request-spot-instance=true",
    ]

Used GitLab Runner version

Possible fixes

Edited by Ben