Runner control host terminated a freshly created EC2 instance before it has a chance to process the assigned job
Summary
EC2 runner instance spawned by docker+machine executor gets terminated before it processes the assigned CI job due to "too many idle machines". It's the first time we've had this issue reported by our users since updating from GitLab Runner v13.9.0.
Steps to reproduce
This is really hard to reproduce since I believe that we're looking at a data race in the runner itself.
Actual behavior
New EC2 instance gets spawned and then terminated two minutes after. The CI job is left hanging for 242 minutes despite 1 hour limit. Merge trains are blocked due to this happening.
Expected behavior
The CI job should be processed by the runner. If the control host needs to remove a machine, an older instance should be removed instead of the newest.
Relevant logs and/or screenshots
control host log
{"driver":"amazonec2","level":"info","msg":"Running pre-create checks...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:25Z"}
{"driver":"amazonec2","level":"info","msg":"Creating machine...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:25Z"}
{"driver":"amazonec2","level":"info","msg":"(RUNNER_INSTANCE_IDENTIFIER) Launching instance...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:25Z"}
{"driver":"amazonec2","level":"info","msg":"Waiting for machine to be running, this may take a few minutes...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:36Z"}
{"driver":"amazonec2","level":"info","msg":"Detecting operating system of created instance...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:36Z"}
{"driver":"amazonec2","level":"info","msg":"Waiting for SSH to be available...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:36Z"}
{"driver":"amazonec2","level":"info","msg":"Detecting the provisioner...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:52Z"}
{"driver":"amazonec2","level":"info","msg":"Provisioning with ubuntu(systemd)...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:52Z"}
{"driver":"amazonec2","level":"info","msg":"Installing Docker...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:01Z"}
{"driver":"amazonec2","level":"info","msg":"Copying certs to the local machine directory...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:02Z"}
{"driver":"amazonec2","level":"info","msg":"Copying certs to the remote machine...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:03Z"}
{"driver":"amazonec2","level":"info","msg":"Setting Docker configuration on the remote daemon...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:05Z"}
{"driver":"amazonec2","level":"info","msg":"Checking connection to Docker...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:08Z"}
{"driver":"amazonec2","level":"info","msg":"Docker is up and running!","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:08Z"}
{"driver":"amazonec2","level":"info","msg":"To see how to connect your Docker Client to the Docker Engine running on this virtual machine, run: docker-machine env RUNNER_INSTANCE_IDENTIFIER","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:08Z"}
{"duration":43649364803,"level":"info","msg":"Machine created","name":"RUNNER_INSTANCE_IDENTIFIER","now":"2021-09-01T09:06:08.805622352Z","retries":0,"time":"2021-09-01T09:06:08Z"}
{"level":"warning","lifetime":43651291774,"msg":"Requesting machine removal","name":"RUNNER_INSTANCE_IDENTIFIER","now":"2021-09-01T09:06:08.807523452Z","reason":"too many idle machines","time":"2021-09-01T09:06:08Z","used":1899094,"usedCount":0}
{"level":"warning","lifetime":43671699134,"msg":"Stopping machine","name":"RUNNER_INSTANCE_IDENTIFIER","reason":"too many idle machines","time":"2021-09-01T09:06:08Z","used":20344654,"usedCount":0}
{"level":"info","msg":"Stopping \"RUNNER_INSTANCE_IDENTIFIER\"...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"stop","time":"2021-09-01T09:06:08Z"}
{"error":"signal: killed","level":"warning","lifetime":103672615716,"msg":"Error while stopping machine","name":"RUNNER_INSTANCE_IDENTIFIER","reason":"too many idle machines","time":"2021-09-01T09:07:08Z","used":60021261334,"usedCount":0}
{"level":"warning","lifetime":103672673562,"msg":"Removing machine","name":"RUNNER_INSTANCE_IDENTIFIER","reason":"too many idle machines","time":"2021-09-01T09:07:08Z","used":60021318678,"usedCount":0}
{"level":"info","msg":"About to remove RUNNER_INSTANCE_IDENTIFIER","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"remove","time":"2021-09-01T09:07:08Z"}
{"level":"info","msg":"WARNING: This action will delete both local reference and remote instance.","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"remove","time":"2021-09-01T09:07:08Z"}
{"level":"info","msg":"Successfully removed RUNNER_INSTANCE_IDENTIFIER","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"remove","time":"2021-09-01T09:07:09Z"}
{"level":"info","lifetime":103879059631,"msg":"Machine removed","name":"RUNNER_INSTANCE_IDENTIFIER","now":"2021-09-01T09:07:09.035291554Z","reason":"too many idle machines","retries":0,"time":"2021-09-01T09:07:09Z","used":60227705066,"usedCount":0}
job log
[0KRunning with gitlab-runner 14.0.1 (c1edb478)
[0;m[0K on runner_tag runner_token
[0;msection_start:1630487125:resolve_secrets
[0K[0K[36;1mResolving secrets[0;m
[0;msection_end:1630487125:resolve_secrets
[0Ksection_start:1630487125:prepare_executor
[0K[0K[36;1mPreparing the "docker+machine" executor[0;m
Environment description
Self-hosted GitLab 14.1.4-ee with self-hosted GitLab Runner v14.0.1, running docker-machine executor.
Part of the configuration relating to the affected runner configuration
config.toml contents
[[runners]]
name = "runner-name"
url = "https://gitlab-hostname/"
token = "runner-token"
executor = "docker+machine"
limit = 10
[runners.docker]
image = "alpine:3.14"
privileged = true
disable_cache = false
runtime = "runc"
volumes = [
"/certs/client",
"/cache",
]
shm_size = 2147483648
[runners.cache]
Type = "s3"
Path = "worker/cache"
Shared = true
[runners.cache.s3]
ServerAddress = "s3.amazonaws.com"
BucketName = "bucket-name"
BucketLocation = "eu-west-1"
Insecure = false
[runners.machine]
IdleCount = 0
IdleTime = 0
MaxBuilds = 1
MachineDriver = "amazonec2"
MachineName = "runner_tag-%s"
MachineOptions = [
"engine-storage-driver=overlay2",
"engine-install-url=https://path/to/install/script",
"engine-registry-mirror=https://mymirror", "amazonec2-ami=ami-id",
"amazonec2-keypair-name=keypair",
"amazonec2-iam-instance-profile=instance-profile",
"amazonec2-instance-type=c5.4xlarge",
"amazonec2-region=eu-west-1",
"amazonec2-security-group=security-group",
"amazonec2-subnet-id=subnet-id",
"amazonec2-ssh-user=ssh-user",
"amazonec2-ssh-keypath=/path/to/ssh/key",
"amazonec2-use-private-address=true",
"amazonec2-private-address-only=true",
"amazonec2-vpc-id=vpc",
"amazonec2-zone=a",
"amazonec2-retries=25",
"amazonec2-volume-type=gp2",
"amazonec2-root-size=48",
"amazonec2-userdata=/path/to/userdata",
"amazonec2-tags=tags",
"amazonec2-request-spot-instance=false",
"amazonec2-spot-price=",
]
[[runners.machine.autoscaling]]
Periods = [
"* * 0-5,20-23 * * mon-fri *",
"* * * * * sat,sun *"
]
IdleCount = 0
IdleTime = 0
Timezone = "Europe/London"
Used GitLab Runner version
v14.0.1 with docker-machine executor on AWS
Possible fixes
When running go test -race
on gitlab-runner v14.0.1 codebase, in registry.gitlab.com/gitlab-org/gitlab-runner/ci:1.13.8-13
docker image, I can see the executor/docker/machine
package tests sometimes detecting a data race, but not every go test -race
run produces this output.
Tests were run using the following command make simple-test CGO_ENABLED=1 TESTFLAGS="-race" CI=0
test output
time="2021-09-06T13:03:42Z" level=info msg="Machine created" duration="41.9µs" name=test-machine-1630933422-08508612 now="2021-09-06 13:03:42.2925512 +0000 UTC m=+9.201027301" retries=0
==================
WARNING: DATA RACE
Write at 0x00c000468148 by goroutine 623:
gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).createWithGrowthCapacity()
/tmp/code/executors/docker/machine/provider.go:132 +0x742
gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).create.func1()
/tmp/code/executors/docker/machine/provider.go:103 +0x6e
gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*runnerMachinesCoordinator).waitForGrowthCapacity()
/tmp/code/executors/docker/machine/coordinator.go:44 +0x102
Previous read at 0x00c000468148 by goroutine 476:
gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).updateMachine()
/tmp/code/executors/docker/machine/provider.go:367 +0x56
gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).updateMachines()
/tmp/code/executors/docker/machine/provider.go:401 +0x228
gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).Acquire()
/tmp/code/executors/docker/machine/provider.go:494 +0x179
gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.TestMachineIdleLimits()
/tmp/code/executors/docker/machine/provider_test.go:708 +0x30d
testing.tRunner()
/usr/local/go/src/testing/testing.go:909 +0x199
Goroutine 623 (running) created at:
gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).create()
/tmp/code/executors/docker/machine/provider.go:102 +0x3a2
gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).createMachines()
/tmp/code/executors/docker/machine/provider.go:428 +0x5a
gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).Acquire()
/tmp/code/executors/docker/machine/provider.go:497 +0x1ff
gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.TestMachineIdleLimits()
/tmp/code/executors/docker/machine/provider_test.go:702 +0x179
testing.tRunner()
/usr/local/go/src/testing/testing.go:909 +0x199
Goroutine 476 (running) created at:
testing.(*T).Run()
/usr/local/go/src/testing/testing.go:960 +0x651
testing.runTests.func1()
/usr/local/go/src/testing/testing.go:1202 +0xa6
testing.tRunner()
/usr/local/go/src/testing/testing.go:909 +0x199
testing.runTests()
/usr/local/go/src/testing/testing.go:1200 +0x521
testing.(*M).Run()
/usr/local/go/src/testing/testing.go:1117 +0x2ff
main.main()
_testmain.go:108 +0x223