Runner starts to execute scripts before client certs are generated (unable to resolve docker endpoint: open /certs/client/ca.pem)

Summary

We just updated our runners to use TLS for DIND. Now we have a very high failure rate on our CI jobs, because the runners are not able to generate the docker client certificates prior to starting the execution of the script section of the job.

Steps to reproduce

Use DIND service v19 or v20.
Setup CI variables to enable TLS on a kubernetes runner.
Create a trivial job that runs a docker command in the script section.

.gitlab-ci.yml

image: $REPO_URL/stage

services:
  - docker:dind

variables:
  DOCKER_HOST: tcp://localhost:2376
  DOCKER_TLS_CERTDIR: "/certs"
  DOCKER_TLS_VERIFY: 1
  DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client"

stages:
  - test

test:
  stage: test
  tags:
    - kube-exec
  script:
    - docker info

Actual behavior

At least 75% chance that the job will fail with unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory.

Expected behavior

The job should execute the docker command and complete successfully.

Relevant logs and/or screenshots

job log

Running with gitlab-runner 13.7.0 (943fc252)
  on gitlab-runner-small-new-gitlab-runner-57496854f7-75972 1HpsWcrw
Resolving secrets
00:00
Preparing the "kubernetes" executor
00:00
Using Kubernetes namespace: gitlab
Using Kubernetes executor with image $REPO_URL/stage ...
Preparing environment
00:04
Waiting for pod gitlab/runner-1hpswcrw-project-209-concurrent-06bpt4 to be running, status is Pending
Running on runner-1hpswcrw-project-209-concurrent-06bpt4 via gitlab-runner-small-new-gitlab-runner-57496854f7-75972...
Getting source from Git repository
00:00
Fetching changes...
Initialized empty Git repository in /builds/externalci/ci_testing/.git/
Created fresh repository.
Checking out a53a00c7 as master...
Skipping Git submodules setup
Executing "step_script" stage of the job script
00:00
$ docker info
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Cleaning up file based variables
00:01
ERROR: Job failed: command terminated with exit code 1

I've found that by putting a sleep of 2-3 seconds before the first docker command, there is enough time for the docker client certs to be generated and then my CI jobs work correctly.

job log when using sleeps

Running with gitlab-runner 13.7.0 (943fc252)
  on gitlab-runner-small-new-gitlab-runner-65d67fcb9c-p2l5q -VdBG-QW
Resolving secrets
00:00
Preparing the "kubernetes" executor
00:00
Using Kubernetes namespace: gitlab
Using Kubernetes executor with image $REPO_URL/stage ...
Preparing environment
00:03
Waiting for pod gitlab/runner--vdbg-qw-project-209-concurrent-0pvl9b to be running, status is Pending
Running on runner--vdbg-qw-project-209-concurrent-0pvl9b via gitlab-runner-small-new-gitlab-runner-65d67fcb9c-p2l5q...
Getting source from Git repository
00:01
Fetching changes...
Initialized empty Git repository in /builds/externalci/ci_testing/.git/
Created fresh repository.
Checking out 8fdd7342 as test/docker...
Skipping Git submodules setup
Executing "step_script" stage of the job script
00:06
$ ls -R /certs/client
/certs/client:
key.pem
$ sleep 1
$ ls -R /certs/client
/certs/client:
key.pem
$ sleep 1
$ ls -R /certs/client
/certs/client:
ca.pem
cert.pem
csr.pem
key.pem
openssl.cnf
$ sleep 1
$ ls -R /certs/client
/certs/client:
ca.pem
cert.pem
csr.pem
key.pem
openssl.cnf
$ docker -D info
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Client:
 Debug Mode: true
Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 20.10.1
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.203-156.332.amzn2.x86_64
 Operating System: Alpine Linux v3.12 (containerized)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 30.8GiB
 Name: runner--vdbg-qw-project-209-concurrent-0pvl9b
 ID: 7YLF:ZLXD:XAPG:KZDE:7J25:WSIS:4UT7:4RWL:LDXV:UUUO:R5GL:F3VX
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine
Cleaning up file based variables
00:00
Job succeeded

Environment description

We have gitlab 13.7.1-ee self-hosted in AWS. For our runners, we use the kubernetes executor with the Helm chart to run our jobs in kubernetes. Most of our CI jobs run docker commands, so we have the DIND service on all jobs. We recently updated to the latest helm chart (v0.24.0) and new values.yaml format. At the same time, we changed to enable TLS (migrated from Docker v18.x to Docker v20.x).

Here is the runners: section of values.yaml

config.toml contents

runners:
  config: |

    [[runners]]
      environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=1"]
      [runners.kubernetes]
        image = "docker:latest"
        cpu_request = "100m"
        memory_request = "512Mi"
        service_cpu_request = "200m"
        service_memory_request = "512Mi"
        helper_cpu_request = "100m"
        helper_memory_request = "128Mi"
        poll_timeout = 180
        priviledged = true
        [runners.kubernetes.volumes]
          [[runners.kubernetes.volumes.empty_dir]]
            name = "docker-certs"
            mount_path = "/certs/client"
            medium = "Memory"

Used GitLab Runner version

13.7.0 deployed via kubernetes using helm chart v0.24.0.

Possible fixes

There needs to be a way to ensure that the docker client certificates are done being generated BEFORE any of the job scripts start (i.e. prior to starting the before_script or script section of the job). I've simulated this with the sleep N shell command.

Edited Dec 30, 2020 by Nick Davis