Docker-machine on digitalocean does not wait long enough (and leaves a zombie machine)
Summary
Gitlab-runner using docker-machine on digitalocean has ceased functioning.
Steps to reproduce
Setup gitlab runner on digital ocean using a ubuntu-18 or 20 base image. Start a job, watch the journalctl of gitab-runner. Use the most recent gitlab version of docker-machine and gitlab-runner:
root@gitlab-runner-bastion:~# gitlab-runner --version
Version: 13.6.0
Git revision: 8fa89735
Git branch: 13-6-stable
GO version: go1.13.8
Built: 2020-11-21T06:16:31+0000
OS/Arch: linux/amd64
root@gitlab-runner-bastion:~# docker-machine --version
docker-machine version 0.16.2-gitlab.8, build 38aad0d2
Actual behavior
Gitlab runner uses docker-machine to start a new machine. It creates a droplet, waits for ssh, copies over certificates, restarts the target docker, tries to connect to the target docker and fails with:
Dec 10 01:42:45 gitlab-runner-bastion gitlab-runner[24609]: ERROR: Error creating machine: Error running provisioning: Unable to verify the Docker daemon is listening: Maximum number of retries (10) exceeded driver=digitalocean name=runner-qemdeozz-gitlab-runner-autoscale-1607564407-6a33de96 operation=create
That's the old gitlab-runner message. The new gitlab runner seems to not have that message anymore but everything else is the same around it:
er-qemdeozz-gitlab-runner-autoscale-1607565422-37d6ce51 operation=create
Dec 10 01:59:24 gitlab-runner-bastion gitlab-runner[25328]: Setting Docker configuration on the remote daemon... driver=digitalocean name=runner-qemdeozz-gitlab-runner-autoscale-1607565422-37d6ce51 operation=create
Dec 10 02:00:07 gitlab-runner-bastion gitlab-runner[25328]: WARNING: Problem while reading command output error=read |0: file already closed
The machine remains up (costing money!) and gitlab-runner tries to start another machine. Docker-machine agrees the target docker isn't up:
$ docker-machine ls
runner-XXXXX-gitlab-runner-autoscale-XXXXX - digitalocean Running tcp://XXX.XXX.XXX.XXX:2376 Unknown Unable to query docker version: Cannot connect to the docker engine endpoint
But if we wait one more minute then the target docker does come up!
runner-qemdeozz-gitlab-runner-autoscale-1607564842-fdee4eda - digitalocean Running tcp://104.131.182.250:2376 v20.10.0
And now it just sits there, costing money while gitlab-runner starts up yet another machine it will abandon.
Expected behavior
gitlab-runner waits long enough, or allows for a parameter.
Relevant logs and/or screenshots
Inline above
Environment description
It's a typical ubuntu + the latest docker-machine from gitlab + gitlab-runner. I tried with ubuntu 18 and 20.
concurrent = 16
check_interval = 10
[session_server]
session_timeout = 1800
[[runners]]
name = "global-gitlab-runner-bastion"
url = "https://gitlab.com/"
token = "..."
executor = "docker+machine"
[runners.custom_build_dir]
[runners.docker]
tls_verify = false
image = "...private..."
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = true
volumes = ["/cache", "/var/lib/docker:/var/lib/docker", "/var/run/docker.sock:/var/run/docker.sock", "/tmp:/tmp"]
shm_size = 0
[runners.cache]
Type = "s3"
Path = "docker-images"
Shared = true
[runners.cache.s3]
ServerAddress = "muse-gitlab-runner.nyc3.digitaloceanspaces.com"
AccessKey = "..."
SecretKey = ..."
BucketName = "muse-gitlab-runner"
BucketLocation = "nyc3"
[runners.machine]
IdleCount = 0
IdleTime = 600
MaxBuilds = 100
MachineDriver = "digitalocean"
MachineName = "gitlab-runner-autoscale-%s"
MachineOptions = ["digitalocean-image=ubuntu-18-04-x64", "digitalocean-ssh-user=root", "digitalocean-access-token=....", "digitalocean-region=nyc3", "digitalocean-size=s-6vcpu-16gb", "digitalocean-private-networking", "digitalocean-tags=runner"]
OffPeakPeriods = ["* * 0-9,18-23 * * mon-fri *", "* * * * * sat,sun *"]
OffPeakTimezone = "America/Denver"
OffPeakIdleCount = 0
Used GitLab Runner version
inline above
Possible fixes
None known.