Skip to content

Runner control host terminated a freshly created EC2 instance before it has a chance to process the assigned job

Summary

EC2 runner instance spawned by docker+machine executor gets terminated before it processes the assigned CI job due to "too many idle machines". It's the first time we've had this issue reported by our users since updating from GitLab Runner v13.9.0.

Steps to reproduce

This is really hard to reproduce since I believe that we're looking at a data race in the runner itself.

Actual behavior

New EC2 instance gets spawned and then terminated two minutes after. The CI job is left hanging for 242 minutes despite 1 hour limit. Merge trains are blocked due to this happening.

image

Expected behavior

The CI job should be processed by the runner. If the control host needs to remove a machine, an older instance should be removed instead of the newest.

Relevant logs and/or screenshots

control host log
{"driver":"amazonec2","level":"info","msg":"Running pre-create checks...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:25Z"}
{"driver":"amazonec2","level":"info","msg":"Creating machine...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:25Z"}
{"driver":"amazonec2","level":"info","msg":"(RUNNER_INSTANCE_IDENTIFIER) Launching instance...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:25Z"}
{"driver":"amazonec2","level":"info","msg":"Waiting for machine to be running, this may take a few minutes...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:36Z"}
{"driver":"amazonec2","level":"info","msg":"Detecting operating system of created instance...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:36Z"}
{"driver":"amazonec2","level":"info","msg":"Waiting for SSH to be available...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:36Z"}
{"driver":"amazonec2","level":"info","msg":"Detecting the provisioner...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:52Z"}
{"driver":"amazonec2","level":"info","msg":"Provisioning with ubuntu(systemd)...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:05:52Z"}
{"driver":"amazonec2","level":"info","msg":"Installing Docker...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:01Z"}
{"driver":"amazonec2","level":"info","msg":"Copying certs to the local machine directory...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:02Z"}
{"driver":"amazonec2","level":"info","msg":"Copying certs to the remote machine...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:03Z"}
{"driver":"amazonec2","level":"info","msg":"Setting Docker configuration on the remote daemon...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:05Z"}
{"driver":"amazonec2","level":"info","msg":"Checking connection to Docker...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:08Z"}
{"driver":"amazonec2","level":"info","msg":"Docker is up and running!","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:08Z"}
{"driver":"amazonec2","level":"info","msg":"To see how to connect your Docker Client to the Docker Engine running on this virtual machine, run: docker-machine env RUNNER_INSTANCE_IDENTIFIER","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"create","time":"2021-09-01T09:06:08Z"}
{"duration":43649364803,"level":"info","msg":"Machine created","name":"RUNNER_INSTANCE_IDENTIFIER","now":"2021-09-01T09:06:08.805622352Z","retries":0,"time":"2021-09-01T09:06:08Z"}
{"level":"warning","lifetime":43651291774,"msg":"Requesting machine removal","name":"RUNNER_INSTANCE_IDENTIFIER","now":"2021-09-01T09:06:08.807523452Z","reason":"too many idle machines","time":"2021-09-01T09:06:08Z","used":1899094,"usedCount":0}
{"level":"warning","lifetime":43671699134,"msg":"Stopping machine","name":"RUNNER_INSTANCE_IDENTIFIER","reason":"too many idle machines","time":"2021-09-01T09:06:08Z","used":20344654,"usedCount":0}
{"level":"info","msg":"Stopping \"RUNNER_INSTANCE_IDENTIFIER\"...","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"stop","time":"2021-09-01T09:06:08Z"}
{"error":"signal: killed","level":"warning","lifetime":103672615716,"msg":"Error while stopping machine","name":"RUNNER_INSTANCE_IDENTIFIER","reason":"too many idle machines","time":"2021-09-01T09:07:08Z","used":60021261334,"usedCount":0}
{"level":"warning","lifetime":103672673562,"msg":"Removing machine","name":"RUNNER_INSTANCE_IDENTIFIER","reason":"too many idle machines","time":"2021-09-01T09:07:08Z","used":60021318678,"usedCount":0}
{"level":"info","msg":"About to remove RUNNER_INSTANCE_IDENTIFIER","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"remove","time":"2021-09-01T09:07:08Z"}
{"level":"info","msg":"WARNING: This action will delete both local reference and remote instance.","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"remove","time":"2021-09-01T09:07:08Z"}
{"level":"info","msg":"Successfully removed RUNNER_INSTANCE_IDENTIFIER","name":"RUNNER_INSTANCE_IDENTIFIER","operation":"remove","time":"2021-09-01T09:07:09Z"}
{"level":"info","lifetime":103879059631,"msg":"Machine removed","name":"RUNNER_INSTANCE_IDENTIFIER","now":"2021-09-01T09:07:09.035291554Z","reason":"too many idle machines","retries":0,"time":"2021-09-01T09:07:09Z","used":60227705066,"usedCount":0}
cloudtrail log image
job log
[0KRunning with gitlab-runner 14.0.1 (c1edb478)
[0;m[0K  on runner_tag runner_token
[0;msection_start:1630487125:resolve_secrets
[0K[0K[36;1mResolving secrets[0;m
[0;msection_end:1630487125:resolve_secrets
[0Ksection_start:1630487125:prepare_executor
[0K[0K[36;1mPreparing the "docker+machine" executor[0;m

Environment description

Self-hosted GitLab 14.1.4-ee with self-hosted GitLab Runner v14.0.1, running docker-machine executor.

Part of the configuration relating to the affected runner configuration

config.toml contents
[[runners]]
  name = "runner-name"
  url = "https://gitlab-hostname/"
  token = "runner-token"
  executor = "docker+machine"
  limit = 10
  [runners.docker]
    image = "alpine:3.14"
    privileged = true
    disable_cache = false
    runtime = "runc"
    volumes = [
            "/certs/client",
            "/cache",
          ]
    shm_size = 2147483648
  [runners.cache]
    Type = "s3"
    Path = "worker/cache"
    Shared = true
    [runners.cache.s3]
      ServerAddress = "s3.amazonaws.com"
      BucketName = "bucket-name"
      BucketLocation = "eu-west-1"
      Insecure = false
  [runners.machine]
    IdleCount = 0
    IdleTime = 0
    MaxBuilds = 1
    MachineDriver = "amazonec2"
    MachineName = "runner_tag-%s"
    MachineOptions = [
      "engine-storage-driver=overlay2",
      "engine-install-url=https://path/to/install/script",
      "engine-registry-mirror=https://mymirror",      "amazonec2-ami=ami-id",
      "amazonec2-keypair-name=keypair",
      "amazonec2-iam-instance-profile=instance-profile",
      "amazonec2-instance-type=c5.4xlarge",
      "amazonec2-region=eu-west-1",
      "amazonec2-security-group=security-group",
      "amazonec2-subnet-id=subnet-id",
      "amazonec2-ssh-user=ssh-user",
      "amazonec2-ssh-keypath=/path/to/ssh/key",
      "amazonec2-use-private-address=true",
      "amazonec2-private-address-only=true",
      "amazonec2-vpc-id=vpc",
      "amazonec2-zone=a",
      "amazonec2-retries=25",
      "amazonec2-volume-type=gp2",
      "amazonec2-root-size=48",
      "amazonec2-userdata=/path/to/userdata",
      "amazonec2-tags=tags",
      "amazonec2-request-spot-instance=false",
      "amazonec2-spot-price=",
    ]
    [[runners.machine.autoscaling]]
      Periods = [
        "* * 0-5,20-23 * * mon-fri *",
        "* * * * * sat,sun *"
      ]
      IdleCount = 0
      IdleTime = 0
      Timezone = "Europe/London"

Used GitLab Runner version

v14.0.1 with docker-machine executor on AWS

Possible fixes

When running go test -race on gitlab-runner v14.0.1 codebase, in registry.gitlab.com/gitlab-org/gitlab-runner/ci:1.13.8-13 docker image, I can see the executor/docker/machine package tests sometimes detecting a data race, but not every go test -race run produces this output.

Tests were run using the following command make simple-test CGO_ENABLED=1 TESTFLAGS="-race" CI=0

test output
time="2021-09-06T13:03:42Z" level=info msg="Machine created" duration="41.9µs" name=test-machine-1630933422-08508612 now="2021-09-06 13:03:42.2925512 +0000 UTC m=+9.201027301" retries=0
==================
WARNING: DATA RACE
Write at 0x00c000468148 by goroutine 623:
  gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).createWithGrowthCapacity()
      /tmp/code/executors/docker/machine/provider.go:132 +0x742
  gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).create.func1()
      /tmp/code/executors/docker/machine/provider.go:103 +0x6e
  gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*runnerMachinesCoordinator).waitForGrowthCapacity()
      /tmp/code/executors/docker/machine/coordinator.go:44 +0x102

Previous read at 0x00c000468148 by goroutine 476:
  gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).updateMachine()
      /tmp/code/executors/docker/machine/provider.go:367 +0x56
  gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).updateMachines()
      /tmp/code/executors/docker/machine/provider.go:401 +0x228
  gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).Acquire()
      /tmp/code/executors/docker/machine/provider.go:494 +0x179
  gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.TestMachineIdleLimits()
      /tmp/code/executors/docker/machine/provider_test.go:708 +0x30d
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:909 +0x199

Goroutine 623 (running) created at:
  gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).create()
      /tmp/code/executors/docker/machine/provider.go:102 +0x3a2
  gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).createMachines()
      /tmp/code/executors/docker/machine/provider.go:428 +0x5a
  gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.(*machineProvider).Acquire()
      /tmp/code/executors/docker/machine/provider.go:497 +0x1ff
  gitlab.com/gitlab-org/gitlab-runner/executors/docker/machine.TestMachineIdleLimits()
      /tmp/code/executors/docker/machine/provider_test.go:702 +0x179
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:909 +0x199

Goroutine 476 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:960 +0x651
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:1202 +0xa6
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:909 +0x199
  testing.runTests()
      /usr/local/go/src/testing/testing.go:1200 +0x521
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:1117 +0x2ff
  main.main()
      _testmain.go:108 +0x223