per-build networking breaks DNS configuration for DinD

Summary

The per-build networking mode causes the DNS configuration of the host system to not be picked up by containers running inside of Docker-in-Docker (DinD). Docker falls back to hard-coded DNS servers 8.8.8.8 and 8.8.4.4.

This is a problem particularly in corporate/institutional networks where outgoing DNS traffic may be blocked, i.e. public DNS resolvers cannot be reached. As a result any docker build of a container image which requires network access (inside a RUN command) fails.

Steps to reproduce

Pure explanation of the issue in Docker at the end of this section.

Repository consisting of the following Dockerfile and .gitlab-ci.yml.

FROM busybox

RUN cat /etc/resolv.conf
RUN nslookup google.com || true
RUN wget -O /google.com.html https://google.com/
.gitlab-ci.yml
stages:
 - build

build-image:
  stage: build
  image: docker:20.10
  tags:
    - docker-privileged
  services:
    - docker:20.10-dind
  script:
    - docker build -t myimage .
The underlying problem, docker only

The following example demonstrates the underlying issue by emulating some of the steps performed by GitLab Runner, specifically creating a custom docker network and connecting DinD to it:

# Host resolv.conf, using a company-internal DNS resolver
$ cat /etc/resolv.conf
search corp.com

nameserver 192.168.53.53
# Create per-build network
$ docker network create test
# Start the docker:dind container connected to the network
$ docker run -d --name dind --net test --privileged docker:dind
# DinD resolv.conf, using a forwarding resolver specific to the custom docker network `test`
$ docker exec dind cat /etc/resolv.conf
search corp.com
nameserver 127.0.0.11
options ndots:0
# Name resolution/ping works
$ docker exec dind ping -c 1 google.com
PING google.com (142.250.185.174): 56 data bytes
64 bytes from 142.250.185.174: seq=0 ttl=111 time=10.567 ms

--- google.com ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 10.567/10.567/10.567 ms
# Container inside of DinD resolv.conf, falling back to hard-coded docker defaults
$ docker exec dind docker run busybox cat /etc/resolv.conf
Unable to find image 'busybox:latest' locally
latest: Pulling from library/busybox
aa2a8d90b84c: Pulling fs layer
aa2a8d90b84c: Download complete
aa2a8d90b84c: Pull complete
Digest: sha256:be4684e4004560b2cd1f12148b7120b0ea69c385bcc9b12a637537a2c60f97fb
Status: Downloaded newer image for busybox:latest
search corp.com
options ndots:0

nameserver 8.8.8.8
nameserver 8.8.4.4
# Name resolution does not work as outgoing DNS traffic is blocked by the corporation's firewall
$ docker exec -it dind docker run busybox ping -c 1 google.com
ping: bad address 'google.com'
# Tearing down...
$ docker stop dind
dind
$ docker rm dind
dind
$ docker network remove test
test

This effect is the result of the following:

  • Each custom network has a docker-embedded DNS resolver for resolving service names. Connected containers are configured with this resolver in resolv.conf.
    • A custom network is created because the per-build networking mode is enabled via the FF_NETWORK_PER_BUILD feature flag.
    • The resolver is available on 127.0.0.11 for each container connected to the custom network. This cannot be overwritten, i.e. --dns 192.168.53.53 does not have any effect.
  • For child containers started by dind, the default behaviour of Docker applies for populating resolv.conf, as these are not connected to a custom network.
    • Docker uses the resolv.conf of the "host" (the dind container), stripping away any localhost nameservers (like 127.0.0.11)
    • If no nameservers remain, Docker adds a hard-coded set of default nameservers (8.8.8.8, 8.8.4.4)
    • The resulting list of nameservers is written to the resolv.conf of the child container

This issue is known but somewhat stalled: moby/moby#20037 (comment).

The workaround is to specify the DNS servers to use explicitly for child containers running inside of DinD. Either of the following solution fix the issue

  • Configure the DNS on each container which is started within DinD.

    $ docker run -d --name dind --net test --privileged docker:dind
    $ docker exec dind docker run --dns 141.52.3.3 --dns 129.13.64.5 busybox ping -c 1 google.com
  • Configure the default DNS when starting docker:dind dockerd.

    $ docker run -d --name dind --net test --privileged docker:dind --dns 141.52.3.3 --dns 129.13.64.5
    $ docker exec dind docker run busybox ping -c 1 google.com

Actual behavior

Building of the Docker image fails because the build containers running inside of DinD cannot fetch the required files from the internet as DNS names cannot be resolved. (Could also be any package fetching/installation).

Expected behavior

Image is built as the DNS configuration of the host is used inside the build containers.

Relevant logs and/or screenshots

job log

I omitted (OMITTED) log output related to #27686.

Running with gitlab-runner 13.11.0 (7f7a4bb0)
  on pauls-test-runner-docker-privileged 3gq-aACs
  feature flags: FF_NETWORK_PER_BUILD:true
Preparing the "docker" executor
Using Docker executor with image docker:20.10 ...
Starting service docker:20.10-dind ...
Pulling docker image docker:20.10-dind ...
Using docker image sha256:dc8c389414c80f3c6510d3690cd03c29fc99d66f58955f138248499a34186bfa for docker:20.10-dind with digest docker@sha256:87ed8e3a7b251eef42c2e4251f95ae3c5f8c4c0a64900f19cc532d0a42aa7107 ...
Waiting for services to be up and running...
*** WARNING: Service runner-3gq-aacs-project-25822-concurrent-0-070528995f596ee8-docker-0 probably didn't start properly.
Health check error:
service "runner-3gq-aacs-project-25822-concurrent-0-070528995f596ee8-docker-0-wait-for-service" timeout
Health check container logs:
Service container logs:
OMITTED
*********
Pulling docker image docker:20.10 ...
Using docker image sha256:d2979b152a7d43f040c7aef88c4c83de4e545227622b1045adf6fe409293f803 for docker:20.10 with digest docker@sha256:062edd9c11cbdf94e7620d932857a356fa179eaa26a3cc352759e75f04729c49 ...
Preparing environment
Running on runner-3gq-aacs-project-25822-concurrent-0 via build-ci...
Getting source from Git repository
Fetching changes with git depth set to 50...
Initialized empty Git repository in /builds/cy8791/dind-dns-test/.git/
Created fresh repository.
Checking out 71bf4985 as main...
Skipping Git submodules setup
Executing "step_script" stage of the job script
Using docker image sha256:d2979b152a7d43f040c7aef88c4c83de4e545227622b1045adf6fe409293f803 for docker:20.10 with digest docker@sha256:062edd9c11cbdf94e7620d932857a356fa179eaa26a3cc352759e75f04729c49 ...
$ docker build -t myimage .
Step 1/4 : FROM busybox
latest: Pulling from library/busybox
aa2a8d90b84c: Pulling fs layer
aa2a8d90b84c: Verifying Checksum
aa2a8d90b84c: Download complete
aa2a8d90b84c: Pull complete
Digest: sha256:be4684e4004560b2cd1f12148b7120b0ea69c385bcc9b12a637537a2c60f97fb
Status: Downloaded newer image for busybox:latest
 ---> c55b0f125dc6
Step 2/4 : RUN cat /etc/resolv.conf
 ---> Running in 7d3e7642c93f
search corp.com
options ndots:0
nameserver 8.8.8.8
nameserver 8.8.4.4
Removing intermediate container 7d3e7642c93f
 ---> eae2b70b7bcf
Step 3/4 : RUN nslookup google.com || true
 ---> Running in 2303181946a4
;; connection timed out; no servers could be reached
Removing intermediate container 2303181946a4
 ---> 242abe799a60
Step 4/4 : RUN wget -O /google.com.html https://google.com/
 ---> Running in 40ebcfcd7de1
wget: bad address 'google.com'
The command '/bin/sh -c wget -O /google.com.html https://google.com/' returned a non-zero code: 1
Cleaning up file based variables
ERROR: Job failed: exit code 1

Environment description

The custom-installed runner is executed on a host inside a network where outgoing DNS traffic is blocked. That means the DNS servers configured in the host's resolv.conf must be used for performing any DNS query.

The runner uses the Docker executor in privileged mode so that Docker images can be built. Recent versions of GitLab Runner and Docker are installed.

config.toml contents
concurrent = 2
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "REDACTED-docker-privileged"
  url = "https://REDACTED/"
  token = "REDACTED"
  executor = "docker"
  environment = ["DOCKER_DRIVER=overlay2", "DOCKER_TLS_CERTDIR=/certs"]
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.feature_flags]
    FF_NETWORK_PER_BUILD = true
  [runners.docker]
    tls_verify = false
    image = "docker:latest"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/certs/client", "/cache"]
    pull_policy = ["always"]
    shm_size = 0
`docker info` output
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
  scan: Docker Scan (Docker Inc.)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 5
 Server Version: 20.10.6
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1160.25.1.el7.x86_64
 Operating System: Red Hat Enterprise Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.696GiB
 Name: build-ci
 ID: YQM6:ZWJI:UQ73:N5GM:K4JL:7PK5:M7CX:GWA4:RYGP:RHUF:O5YX:VPUI
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Used GitLab Runner version

Version:      13.11.0
Git revision: 7f7a4bb0
Git branch:   13-11-stable
GO version:   go1.13.8
Built:        2021-04-20T17:02:28+0000
OS/Arch:      linux/amd64

Possible fixes

To me no fix is apparent.

  1. Ideally, the proper DNS servers would somehow be picked up automatically. This fix would have to occur in Docker/Moby. moby/moby#20037 (comment)
  2. Alternatively, GitLab Runner could provide a mechanism to specify the DNS servers in config.toml which get picked up by docker:dind containers (and their child containers) running as services within CI jobs.

Fixing this bug is not critical, as workarounds are available, and per-build networking is not the default (yet?).

Workarounds

  1. Require each .gitlab-ci.yml to specify DNS explicitly for the docker:dind service, i.e. specify a command dockerd ... --dns 192.168.53.53. This requires developers to know details of the network environment of the GitLab runners.

  2. Provide a DinD service in the GitLab Runner via config.toml which is properly configured, i.e. has a command dockerd ... --dns 192.168.53.53. As the image is fixed in config.toml, there is no way for developers to specify a different version of the image in .gitlab-ci.yml.

  3. Disable per-build networking for the GitLab Runner, i.e. remove the feature flag from config.toml. Then, passing on the host's nameservers through Docker's resolv.conf mechanism works: host → dind → child container

    IMHO this is the preferably workaround as it is simple and preserves the separation of runner administration and developers.

References

To do

  • Let's test the solution outlined in the comment threads to validate there is a viable solution for this problem.
Edited by Darren Eastman