GCE runner stuck at machine creation ssh connection, leaving machines up, not running jobs and costing money
Summary
We have a main runner which triggers Google Cloud docker runner. Since upgrade to 12.2.0 we can't manage to get it working. It can't connect to created machines and complains about various errors including OS not recognized
. It doesn't even have a timeout.
Steps to reproduce
Our job is quite simple: running yarn install && yarn test
on a simple node project. Using image: node:latest
.
Actual behavior
Docker machine creates the machine, but is then stuck in a waiting step for long time. The runner doesn't even delete the machine and it costs us money for nothing.
Please note we didn't changed our configuration.
Fun fact: if a machine, to which runner can't connect, is being preempted, the runner detects it and try to populate a new machine. With no more success running the job! And it keeps getting a machine, doing nothing, up all the time.
Expected behavior
Machine being populated properly then accessed from the runner to be able to run jobs, as previous version did properly.
Relevant logs and/or screenshots
joblog
Running with gitlab-runner 12.2.0 (a987417a)
on gitlab-runner-gce d078f56e
runner log
(runner-d078f56e-runner-1567080128-c6fe00e9) Check that the project exists driver=google name=runner-d07761
6e-runner-1567080128-c6fe00e9 operation=create
(runner-d078f56e-runner-1567080128-c6fe00e9) Check if the instance already exists driver=google name=runner
-d078f56e-runner-1567080128-c6fe00e9 operation=create
Creating machine... driver=google name=runner-d078f56e-runner-1567080128-c6fe
00e9 operation=create
(runner-d078f56e-runner-1567080128-c6fe00e9) Generating SSH Key driver=google name=runner-d078f56e-runn
er-1567080128-c6fe00e9 operation=create
(runner-d078f56e-runner-1567080128-c6fe00e9) Creating host... driver=google name=runner-d078f56e-runner
-1567080128-c6fe00e9 operation=create
(runner-d078f56e-runner-1567080128-c6fe00e9) Opening firewall ports driver=google name=runner-d078f56e-
runner-1567080128-c6fe00e9 operation=create
(runner-d078f56e-runner-1567080128-c6fe00e9) Creating instance driver=google name=runner-d078f56e-runne
r-1567080128-c6fe00e9 operation=create
(runner-d078f56e-runner-1567080128-c6fe00e9) Waiting for Instance driver=google name=runner-d078f56e-ru
nner-1567080128-c6fe00e9 operation=create
(runner-d078f56e-runner-1567080128-c6fe00e9) Uploading SSH Key driver=google name=runner-d078f56e-runne
r-1567080128-c6fe00e9 operation=create
Waiting for machine to be running, this may take a few minutes... driver=google name=runner-d078f56e-runner
-1567080128-c6fe00e9 operation=create
Detecting operating system of created instance... driver=google name=runner-d078f56e-runner-1567080128-c6fe
00e9 operation=create
Waiting for SSH to be available... driver=google name=runner-d078f56e-runner-1567080128-c6fe
00e9 operation=create
Detecting the provisioner... driver=google name=runner-d078f56e-runner-1567080128-c6fe
00e9 operation=create
ERROR: Error creating machine: Error detecting OS: OS type not recognized driver=google name=runner-d078f56e-ex
a-runner-1567080128-c6fe00e9 operation=create
WARNING: Machine creation failed, trying to provision error=exit status 1 name=runner-d078f56e-runner-15670
80458-c6fe00e9
Waiting for SSH to be available... name=runner-d078f56e-runner-1567080128-c6fe00e9 operatio
n=provision
We didn't change in our config the docker-machine image but now the error appears: Error detecting OS: OS type not recognized
. We didn't have the error previously.
Machine is provisionned and available to SSH on google cloud:
Environment description
config.toml contents
concurrent = 1
check_interval = 0
[[runners]]
name = "gitlab-runner-gce"
url = "https://our-gitlab.com/"
token = "SOME_TOKEN"
executor = "docker+machine"
limit = 1
[runners.docker]
tls_verify = false
image = "docker:latest"
privileged = true
disable_cache = false
volumes = ["/var/run/docker.sock:/var/run/docker.sock", "/cache"]
shm_size = 0
[runners.cache]
Type = "s3"
Shared = true
[runners.cache.s3]
ServerAddress = "minio:9000"
AccessKey = "ACCESSKEY"
SecretKey = "SECRETKEY"
BucketName = "cache"
Insecure = true
[runners.machine]
IdleCount = 0
IdleTime = 400
MachineDriver = "google"
MachineName = "runner-%s"
MachineOptions = [
"google-project=some-project",
"google-machine-type=custom-4-8192",
"google-machine-image=coreos-cloud/global/images/family/coreos-stable",
"google-tags=gitlab-ci-slave",
"google-preemptible=true",
"google-zone=us-east1-c",
"google-use-internal-ip=true",
"google-disk-type=pd-ssd"
]
Used GitLab Runner version
gitlab-runner 12.2.0 (a987417a) - gitlab/gitlab-runner:alpine-v12.2.0
Also tested reverting to 12.1.0 and it fails too.
Workaround
Update your docker-machine settings to:
"google-machine-image=https://www.googleapis.com/compute/v1/projects/coreos-cloud/global/images/coreos-stable-2135-6-0-v20190801"
The new image doesn't seem to have ssh access