Kubernetes executor + dind : Build container script is starting before docker service is fully ready

Summary

When using dind in a kubernetes executor, "sometimes" build script seems starting before dind service is fully ready. This is leading to errors where docker seems not immediately ready and/or dind auto-generated TLS certs seems not immediately shared between svc and build containers:

We strictly followed official doc on how to setup dind on kube executor:

Steps to reproduce

In general this issue happens randomly, but mostly in multiple stages pipeline. Each of staged jobs are running dind service.

.gitlab-ci.yml
variables:
  IMG_NAME: $CI_REGISTRY_IMAGE
  SHA_NAME: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA

before_script:
  - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY

stages:
  - build
  - test
  - build-multiarch
  - multiarch

build-amd64:
  stage: build
  tags:
    - builder-dind
  script:
    - docker build -t $IMG_NAME/amd64:$CI_COMMIT_SHORT_SHA ./
    - docker push $IMG_NAME/amd64:$CI_COMMIT_SHORT_SHA
  except:
    - schedules
    - tags

test-app:
  stage: test
  image: $IMG_NAME/amd64:$CI_COMMIT_SHORT_SHA
  tags:
    - builder
  before_script: []
  script:
    - python -m pytest -v --cov-config=.coveragerc --cov=lib tests
  coverage: '/TOTAL.*\s+(\d+%)$/'
  variables:
    GIT_STRATEGY: none
  only:
    - branches
    - master
  except:
    - schedules

build-arm64:
  stage: build-multiarch
  tags:
    - builder-dind
  script:
    - docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
    - docker build --platform linux/arm64 -t $IMG_NAME/arm64:$CI_COMMIT_SHORT_SHA ./
    - docker push $IMG_NAME/arm64:$CI_COMMIT_SHORT_SHA
  variables:
    DOCKER_BUILDKIT: "0"
  except:
    - schedules
    - tags

multiarch:
  stage: multiarch
  tags:
    - builder-dind
  script:
    - docker pull $IMG_NAME/amd64:$CI_COMMIT_SHORT_SHA
    - docker pull $IMG_NAME/arm64:$CI_COMMIT_SHORT_SHA
    - docker manifest create $SHA_NAME --amend $IMG_NAME/amd64:$CI_COMMIT_SHORT_SHA --amend $IMG_NAME/arm64:$CI_COMMIT_SHORT_SHA
    - docker manifest push $SHA_NAME
  variables:
    GIT_STRATEGY: none
  except:
    - schedules
    - tags

Actual behavior

Either docker svc is not completely started or hasn't finished to auto-generate its local self-signed certs. Or either kubernetes is "long" to mount the shared cert directory even if docker is ready, IDK.

Running with gitlab-runner 13.6.0 (8fa89735)
  on builder-dind-gitlab-runner-55fc58b5d-cr8tb acGbRrJM
Preparing the "kubernetes" executor
Using Kubernetes namespace: gitlab
Using Kubernetes executor with image docker:stable-dind ...
Using attach strategy to execute scripts...
Preparing environment
Waiting for pod gitlab/runner-acgbrrjm-project-19995846-concurrent-09hd2z to be running, status is Pending
Running on runner-acgbrrjm-project-19995846-concurrent-09hd2z via builder-dind-gitlab-runner-55fc58b5d-cr8tb...
Getting source from Git repository
Skipping Git repository setup
Skipping Git checkout
Skipping Git submodules setup
Executing "step_script" stage of the job script
$ docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Cleaning up file based variables
ERROR: Job failed: command terminated with exit code 1

Expected behavior

docker commands should complete with any issue immediately after starting the builder container & the job script.

Environment description

My runners are running in a dedicated GKE cluster (1.17.13), nothing special, I'm just throwing helm chart in it and it works like a charm.

builder-dind runner configuration:

[[runners]]
  executor = "kubernetes"
  environment = [
    "FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false",
    "DOCKER_HOST=tcp://localhost:2376",
    "DOCKER_TLS_VERIFY=1",
    "DOCKER_TLS_CERTDIR=/certs",
    "DOCKER_CERT_PATH=$DOCKER_TLS_CERTDIR/client",
    "DOCKER_BUILDKIT=1",
    "DOCKER_CLI_EXPERIMENTAL=enabled"
  ]
  [runners.kubernetes]
    image = "docker:stable-dind"
    privileged = true
    cpu_limit = "200m"
    cpu_request = "100m"
    memory_limit = "256Mi"
    memory_request = "128Mi"
    service_cpu_limit = "3000m"
    service_cpu_request = "2000m"
    service_memory_limit = "6Gi"
    service_memory_request = "4Gi"
    helper_cpu_limit = "100m"
    helper_cpu_request = "25m"
    helper_memory_limit = "128Mi"
    helper_memory_request = "64Mi"
    [[runners.kubernetes.services]]
      name = "docker:stable-dind"
      command = ["--experimental", "--registry-mirror", "https://mirror.gcr.io"]
      alias = "docker"
    [[runners.kubernetes.volumes.empty_dir]]
      name = "docker-certs"
      mount_path = "/certs/client"
      medium = "Memory"

Used GitLab Runner version

Gitlab runner version 13.6.0 Running with gitlab-runner helm chart 0.23.0 with the following values:

gitlabUrl: https://gitlab.com/
runnerRegistrationToken: <MY_SECRET_TOKEN>
unregisterRunners: true
concurrent: 30
checkInterval: 20
rbac:
  create: true
resources:
  limits:
    cpu: 200m
    memory: 256Mi
  requests:
    cpu: 50m
    memory: 128Mi

Possible fixes

I managed to "wait" for the docker and cert files to be OK inside my builder job with a MEGA-WIERD sh trick :

before_script:
  # sometimes dind svc seems too long to start and/or dind certs are long to be available on builder
  - i=0; while [ "$i" -lt 12 ]; do docker info && break; sleep 5; i=$(( i + 1 )) ; done
  - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY

This seems to avoid crashing the whole job by waiting docker-in-docker to initialize correctly:

[...]
Getting source from Git repository
Skipping Git repository setup
Skipping Git checkout
Skipping Git submodules setup
Executing "step_script" stage of the job script
$ i=0; while [ "$i" -lt 12 ]; do docker info && break; sleep 5; i=$(( i + 1 )) ; done
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Client:
 Debug Mode: false
Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 19.03.14
[...]

I'd rather not to do this ! IMO, gitlab-runner kubernetes executor should provide a way to "wait" for a service to be ready, using any appropriate strategy ! (delay, svc tcp check, svc health check, check script ...)

Any help appreciated on this issue :) ! Thanks a lot :)