Runner starts to execute scripts before client certs are generated (unable to resolve docker endpoint: open /certs/client/ca.pem)
Summary
We just updated our runners to use TLS for DIND. Now we have a very high failure rate on our CI jobs, because the runners are not able to generate the docker client certificates prior to starting the execution of the script
section of the job.
Steps to reproduce
- Use DIND service v19 or v20.
- Setup CI variables to enable TLS on a kubernetes runner.
- Create a trivial job that runs a docker command in the
script
section.
.gitlab-ci.yml
image: $REPO_URL/stage
services:
- docker:dind
variables:
DOCKER_HOST: tcp://localhost:2376
DOCKER_TLS_CERTDIR: "/certs"
DOCKER_TLS_VERIFY: 1
DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client"
stages:
- test
test:
stage: test
tags:
- kube-exec
script:
- docker info
Actual behavior
At least 75% chance that the job will fail with unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
.
Expected behavior
The job should execute the docker command and complete successfully.
Relevant logs and/or screenshots
job log
Running with gitlab-runner 13.7.0 (943fc252)
on gitlab-runner-small-new-gitlab-runner-57496854f7-75972 1HpsWcrw
Resolving secrets
00:00
Preparing the "kubernetes" executor
00:00
Using Kubernetes namespace: gitlab
Using Kubernetes executor with image $REPO_URL/stage ...
Preparing environment
00:04
Waiting for pod gitlab/runner-1hpswcrw-project-209-concurrent-06bpt4 to be running, status is Pending
Running on runner-1hpswcrw-project-209-concurrent-06bpt4 via gitlab-runner-small-new-gitlab-runner-57496854f7-75972...
Getting source from Git repository
00:00
Fetching changes...
Initialized empty Git repository in /builds/externalci/ci_testing/.git/
Created fresh repository.
Checking out a53a00c7 as master...
Skipping Git submodules setup
Executing "step_script" stage of the job script
00:00
$ docker info
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Cleaning up file based variables
00:01
ERROR: Job failed: command terminated with exit code 1
I've found that by putting a sleep of 2-3 seconds before the first docker command, there is enough time for the docker client certs to be generated and then my CI jobs work correctly.
job log when using sleeps
Running with gitlab-runner 13.7.0 (943fc252)
on gitlab-runner-small-new-gitlab-runner-65d67fcb9c-p2l5q -VdBG-QW
Resolving secrets
00:00
Preparing the "kubernetes" executor
00:00
Using Kubernetes namespace: gitlab
Using Kubernetes executor with image $REPO_URL/stage ...
Preparing environment
00:03
Waiting for pod gitlab/runner--vdbg-qw-project-209-concurrent-0pvl9b to be running, status is Pending
Running on runner--vdbg-qw-project-209-concurrent-0pvl9b via gitlab-runner-small-new-gitlab-runner-65d67fcb9c-p2l5q...
Getting source from Git repository
00:01
Fetching changes...
Initialized empty Git repository in /builds/externalci/ci_testing/.git/
Created fresh repository.
Checking out 8fdd7342 as test/docker...
Skipping Git submodules setup
Executing "step_script" stage of the job script
00:06
$ ls -R /certs/client
/certs/client:
key.pem
$ sleep 1
$ ls -R /certs/client
/certs/client:
key.pem
$ sleep 1
$ ls -R /certs/client
/certs/client:
ca.pem
cert.pem
csr.pem
key.pem
openssl.cnf
$ sleep 1
$ ls -R /certs/client
/certs/client:
ca.pem
cert.pem
csr.pem
key.pem
openssl.cnf
$ docker -D info
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Client:
Debug Mode: true
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 20.10.1
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
Default Runtime: runc
Init Binary: docker-init
containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.203-156.332.amzn2.x86_64
Operating System: Alpine Linux v3.12 (containerized)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 30.8GiB
Name: runner--vdbg-qw-project-209-concurrent-0pvl9b
ID: 7YLF:ZLXD:XAPG:KZDE:7J25:WSIS:4UT7:4RWL:LDXV:UUUO:R5GL:F3VX
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
Cleaning up file based variables
00:00
Job succeeded
Environment description
We have gitlab 13.7.1-ee self-hosted in AWS. For our runners, we use the kubernetes executor with the Helm chart to run our jobs in kubernetes. Most of our CI jobs run docker commands, so we have the DIND service on all jobs. We recently updated to the latest helm chart (v0.24.0) and new values.yaml format. At the same time, we changed to enable TLS (migrated from Docker v18.x to Docker v20.x).
Here is the runners:
section of values.yaml
config.toml contents
runners:
config: |
[[runners]]
environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=1"]
[runners.kubernetes]
image = "docker:latest"
cpu_request = "100m"
memory_request = "512Mi"
service_cpu_request = "200m"
service_memory_request = "512Mi"
helper_cpu_request = "100m"
helper_memory_request = "128Mi"
poll_timeout = 180
priviledged = true
[runners.kubernetes.volumes]
[[runners.kubernetes.volumes.empty_dir]]
name = "docker-certs"
mount_path = "/certs/client"
medium = "Memory"
Used GitLab Runner version
13.7.0 deployed via kubernetes using helm chart v0.24.0.
Possible fixes
There needs to be a way to ensure that the docker client certificates are done being generated BEFORE any of the job scripts start (i.e. prior to starting the before_script
or script
section of the job). I've simulated this with the sleep N
shell command.