Kubernetes executor + dind : Build container script is starting before docker service is fully ready
Summary
When using dind in a kubernetes executor, "sometimes" build script seems starting before dind service is fully ready. This is leading to errors where docker seems not immediately ready and/or dind auto-generated TLS certs seems not immediately shared between svc and build containers:
We strictly followed official doc on how to setup dind on kube executor:
- https://docs.gitlab.com/runner/executors/kubernetes.html#using-dockerdind
- https://docs.gitlab.com/ee/ci/docker/using_docker_build.html#kubernetes
Steps to reproduce
In general this issue happens randomly, but mostly in multiple stages pipeline. Each of staged jobs are running dind service.
variables:
IMG_NAME: $CI_REGISTRY_IMAGE
SHA_NAME: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA
before_script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
stages:
- build
- test
- build-multiarch
- multiarch
build-amd64:
stage: build
tags:
- builder-dind
script:
- docker build -t $IMG_NAME/amd64:$CI_COMMIT_SHORT_SHA ./
- docker push $IMG_NAME/amd64:$CI_COMMIT_SHORT_SHA
except:
- schedules
- tags
test-app:
stage: test
image: $IMG_NAME/amd64:$CI_COMMIT_SHORT_SHA
tags:
- builder
before_script: []
script:
- python -m pytest -v --cov-config=.coveragerc --cov=lib tests
coverage: '/TOTAL.*\s+(\d+%)$/'
variables:
GIT_STRATEGY: none
only:
- branches
- master
except:
- schedules
build-arm64:
stage: build-multiarch
tags:
- builder-dind
script:
- docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
- docker build --platform linux/arm64 -t $IMG_NAME/arm64:$CI_COMMIT_SHORT_SHA ./
- docker push $IMG_NAME/arm64:$CI_COMMIT_SHORT_SHA
variables:
DOCKER_BUILDKIT: "0"
except:
- schedules
- tags
multiarch:
stage: multiarch
tags:
- builder-dind
script:
- docker pull $IMG_NAME/amd64:$CI_COMMIT_SHORT_SHA
- docker pull $IMG_NAME/arm64:$CI_COMMIT_SHORT_SHA
- docker manifest create $SHA_NAME --amend $IMG_NAME/amd64:$CI_COMMIT_SHORT_SHA --amend $IMG_NAME/arm64:$CI_COMMIT_SHORT_SHA
- docker manifest push $SHA_NAME
variables:
GIT_STRATEGY: none
except:
- schedules
- tags
Actual behavior
Either docker svc is not completely started or hasn't finished to auto-generate its local self-signed certs. Or either kubernetes is "long" to mount the shared cert directory even if docker is ready, IDK.
Running with gitlab-runner 13.6.0 (8fa89735)
on builder-dind-gitlab-runner-55fc58b5d-cr8tb acGbRrJM
Preparing the "kubernetes" executor
Using Kubernetes namespace: gitlab
Using Kubernetes executor with image docker:stable-dind ...
Using attach strategy to execute scripts...
Preparing environment
Waiting for pod gitlab/runner-acgbrrjm-project-19995846-concurrent-09hd2z to be running, status is Pending
Running on runner-acgbrrjm-project-19995846-concurrent-09hd2z via builder-dind-gitlab-runner-55fc58b5d-cr8tb...
Getting source from Git repository
Skipping Git repository setup
Skipping Git checkout
Skipping Git submodules setup
Executing "step_script" stage of the job script
$ docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Cleaning up file based variables
ERROR: Job failed: command terminated with exit code 1
Expected behavior
docker commands should complete with any issue immediately after starting the builder container & the job script.
Environment description
My runners are running in a dedicated GKE cluster (1.17.13), nothing special, I'm just throwing helm chart in it and it works like a charm.
builder-dind runner configuration:
[[runners]]
executor = "kubernetes"
environment = [
"FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false",
"DOCKER_HOST=tcp://localhost:2376",
"DOCKER_TLS_VERIFY=1",
"DOCKER_TLS_CERTDIR=/certs",
"DOCKER_CERT_PATH=$DOCKER_TLS_CERTDIR/client",
"DOCKER_BUILDKIT=1",
"DOCKER_CLI_EXPERIMENTAL=enabled"
]
[runners.kubernetes]
image = "docker:stable-dind"
privileged = true
cpu_limit = "200m"
cpu_request = "100m"
memory_limit = "256Mi"
memory_request = "128Mi"
service_cpu_limit = "3000m"
service_cpu_request = "2000m"
service_memory_limit = "6Gi"
service_memory_request = "4Gi"
helper_cpu_limit = "100m"
helper_cpu_request = "25m"
helper_memory_limit = "128Mi"
helper_memory_request = "64Mi"
[[runners.kubernetes.services]]
name = "docker:stable-dind"
command = ["--experimental", "--registry-mirror", "https://mirror.gcr.io"]
alias = "docker"
[[runners.kubernetes.volumes.empty_dir]]
name = "docker-certs"
mount_path = "/certs/client"
medium = "Memory"
Used GitLab Runner version
Gitlab runner version 13.6.0
Running with gitlab-runner helm chart 0.23.0 with the following values:
gitlabUrl: https://gitlab.com/
runnerRegistrationToken: <MY_SECRET_TOKEN>
unregisterRunners: true
concurrent: 30
checkInterval: 20
rbac:
create: true
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 50m
memory: 128Mi
Possible fixes
I managed to "wait" for the docker and cert files to be OK inside my builder job with a MEGA-WIERD sh trick :
before_script:
# sometimes dind svc seems too long to start and/or dind certs are long to be available on builder
- i=0; while [ "$i" -lt 12 ]; do docker info && break; sleep 5; i=$(( i + 1 )) ; done
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
This seems to avoid crashing the whole job by waiting docker-in-docker to initialize correctly:
[...]
Getting source from Git repository
Skipping Git repository setup
Skipping Git checkout
Skipping Git submodules setup
Executing "step_script" stage of the job script
$ i=0; while [ "$i" -lt 12 ]; do docker info && break; sleep 5; i=$(( i + 1 )) ; done
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Client:
Debug Mode: false
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 19.03.14
[...]
I'd rather not to do this ! IMO, gitlab-runner kubernetes executor should provide a way to "wait" for a service to be ready, using any appropriate strategy ! (delay, svc tcp check, svc health check, check script ...)
Any help appreciated on this issue :) ! Thanks a lot :)