Autoscaling AWS EC2 does not work due to docker installing issue

Summary

Today our Gitlab Runner with AWS EC2 autoscaling (docker-machine) randomly stopped working and I'm still searching for solutions why it fails to run.

I already tried:

  • Updating all software parts to the newest version
  • Recreating docker-machine certs
  • Reboot the system

But nothing solved the issues. I still can't successfully start the docker machines (they start on AWS and I can also connect to them while they boot via docker-machine ssh <name>, but they get killed as they show Unable to query docker version: Cannot connect to the docker engine endpoint when I run docker-machine ls)

Versions:

  • Docker: Docker version 20.10.14, build a224086
  • Docker Machine: docker-machine version 0.16.2-gitlab.13, build dac65f58
  • Gitlab Runner: 14.10.0 c6bb62f6

Error log with debug enabled

Apr 26 17:52:13 ip-172-31-39-74 gitlab-runner[19337]: Requeued the runner                                 builds=1 runner=rtB9T4xx
Apr 26 17:52:13 ip-172-31-39-74 gitlab-runner[19337]: Running with gitlab-runner 14.10.0 (c6bb62f6)       job=2379908834 project=3139350 runner=rtB9T4xx
Apr 26 17:52:13 ip-172-31-39-74 gitlab-runner[19337]:   on autoscaling-mycompany rtB9T4xx                  job=2379908834 project=3139350 runner=rtB9T4xx
Apr 26 17:52:13 ip-172-31-39-74 gitlab-runner[19337]: Preparing the "docker+machine" executor  job=2379908834 project=3139350 runner=rtB9T4xx
Apr 26 17:52:13 ip-172-31-39-74 gitlab-runner[19337]: Executing /usr/local/bin/docker-machine [docker-machine --bugsnag-api-token=no-report create --driver amazonec2 --amazonec2-request-spot-instance=false --amazonec2-access-key=<redacted> --amazonec2-secret-key=<redacted> --amazonec2-region=eu-central-1 --amazonec2-vpc-id=vpc-<redacted> --amazonec2-subnet-id=subnet-<redacted> --amazonec2-zone=b --amazonec2-use-private-address=true --amazonec2-tags=runner-manager-name,gitlab-aws-autoscaler,gitlab,true,gitlab-runner-autoscale,true --amazonec2-security-group=gitlab-aws-runner --amazonec2-instance-type=t3.large --amazonec2-root-size=80 --engine-registry-mirror=http://172.31.39.74:6000 runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f]
Apr 26 17:52:14 ip-172-31-39-74 gitlab-runner[19337]: Running pre-create checks...                        driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f operation=create
Apr 26 17:52:14 ip-172-31-39-74 gitlab-runner[19337]: Creating machine...                                 driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f operation=create
Apr 26 17:52:14 ip-172-31-39-74 gitlab-runner[19337]: (runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f) Launching instance...  driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f operation=create
Apr 26 17:52:14 ip-172-31-39-74 gitlab-runner[19337]: Checking for jobs... nothing                        runner=rtB9T4xx
[...]
Apr 26 17:52:16 ip-172-31-39-74 gitlab-runner[19337]: Dialing: tcp gitlab.com:443 ...
Apr 26 17:52:17 ip-172-31-39-74 gitlab-runner[19337]: Appending trace to coordinator... ok                code=202 job=2379908834 job-log=0-206 job-status=running runner=rtB9T4xx sent-log=0-205 status=202 Accepted update-interval=3s
[...]
Apr 26 17:52:22 ip-172-31-39-74 gitlab-runner[19337]: Waiting for machine to be running, this may take a few minutes...  driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f operation=create
Apr 26 17:52:22 ip-172-31-39-74 gitlab-runner[19337]: Detecting operating system of created instance...   driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f operation=create
Apr 26 17:52:22 ip-172-31-39-74 gitlab-runner[19337]: Waiting for SSH to be available...                  driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f operation=create
[...]
Apr 26 17:52:41 ip-172-31-39-74 gitlab-runner[19337]: Detecting the provisioner...                        driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f operation=create
Apr 26 17:52:41 ip-172-31-39-74 gitlab-runner[19337]: Provisioning with ubuntu(systemd)...                driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f operation=create
[...]
Apr 26 17:52:54 ip-172-31-39-74 gitlab-runner[19337]: Installing Docker...                                driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f operation=create
[...]
Apr 26 17:53:15 ip-172-31-39-74 gitlab-runner[19337]: ERROR: Error creating machine: Error running provisioning: error installing docker:   driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f operation=create
Apr 26 17:53:15 ip-172-31-39-74 gitlab-runner[19337]: ERROR: Machine creation failed                      error=exit status 1 name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f time=1m1.861464233s
Apr 26 17:53:15 ip-172-31-39-74 gitlab-runner[19337]: WARNING: Requesting machine removal                 lifetime=1m1.861688723s name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f now=2022-04-26 17:53:15.786350131 +0000 UTC m=+174.028766006 reason=Failed to create used=1m1.861689155s usedCount=0
Apr 26 17:53:15 ip-172-31-39-74 gitlab-runner[19337]: WARNING: Stopping machine                           lifetime=1m1.934200501s name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f reason=Failed to create used=72.474199ms usedCount=0
Apr 26 17:53:15 ip-172-31-39-74 gitlab-runner[19337]: Stopping "runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f"...  name=runner-rtb9t4xx-gitlab-docker-machine-1650995533-89265a3f operation=stop
[...]
Apr 26 17:53:16 ip-172-31-39-74 gitlab-runner[19337]: Executing /usr/local/bin/docker-machine [docker-machine --bugsnag-api-token=no-report create --driver amazonec2 --amazonec2-request-spot-instance=false --amazonec2-access-key=xxx --amazonec2-secret-key=xxx --amazonec2-region=eu-central-1 --amazonec2-vpc-id=vpc-0ff1a867 --amazonec2-subnet-id=subnet-88230ef2 --amazonec2-zone=b --amazonec2-use-private-address=true --amazonec2-tags=runner-manager-name,gitlab-aws-autoscaler,gitlab,true,gitlab-runner-autoscale,true --amazonec2-security-group=gitlab-aws-runner --amazonec2-instance-type=t3.large --amazonec2-root-size=80 --engine-registry-mirror=http://172.31.39.74:6000 runner-rtb9t4xx-gitlab-docker-machine-1650995596-ebf056f1]
Apr 26 17:53:16 ip-172-31-39-74 gitlab-runner[19337]: Running pre-create checks...                        driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995596-ebf056f1 operation=create
Apr 26 17:53:16 ip-172-31-39-74 gitlab-runner[19337]: Creating machine...                                 driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995596-ebf056f1 operation=create
Apr 26 17:53:17 ip-172-31-39-74 gitlab-runner[19337]: (runner-rtb9t4xx-gitlab-docker-machine-1650995596-ebf056f1) Launching instance...  driver=amazonec2 name=runner-rtb9t4xx-gitlab-docker-machine-1650995596-ebf056f1 operation=create
Apr 26 17:53:17 ip-172-31-39-74 gitlab-runner[19337]: Submitting job to coordinator... ok                 code=200 job=2379908834 job-status= runner=rtB9T4xx update-interval=0s

config.toml

log_level = "debug"
concurrent = 15
check_interval = 0

listen_address = "172.31.39.74:9252"

[session_server]
  listen_address = "[::]:8093" #  listen on all available interfaces on port 8093
  advertise_address = "xxx"
  session_timeout = 3600

[[runners]]
  name = "autoscaling-mycompany"
  limit = 15
  url = "https://gitlab.com/"
  token = "xxx"
  executor = "docker+machine"
  [runners.docker]
    tls_verify = false
    image = "alpine"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = true
    shm_size = 0
    wait_for_services_timeout=120
  [runners.cache]
    Type = "s3"
    Path = "cache"
    Shared = true
    [runners.cache.s3]
      ServerAddress = "s3.amazonaws.com"
      AccessKey = "xxx"
      SecretKey = "xxx"
      BucketName = "xxx"
      BucketLocation = "eu-central-1"
    [runners.cache.gcs]
  [runners.machine]
    IdleCount = 0
    IdleTime = 900
    MaxBuilds = 15
    MachineDriver = "amazonec2"
    MachineName = "gitlab-docker-machine-%s"
    MachineOptions = [
        "amazonec2-request-spot-instance=false",
        "amazonec2-access-key=xxx",
        "amazonec2-secret-key=xxx",
        "amazonec2-region=eu-central-1",
        "amazonec2-vpc-id=vpc-xxx",
        "amazonec2-subnet-id=subnet-xxx",
        "amazonec2-zone=b",
        "amazonec2-use-private-address=true",
        "amazonec2-tags=runner-manager-name,gitlab-aws-autoscaler,gitlab,true,gitlab-runner-autoscale,true",
        "amazonec2-security-group=gitlab-aws-runner",
        "amazonec2-instance-type=t3.large",
        "amazonec2-root-size=80",
        "engine-registry-mirror=http://172.31.39.74:6000"
    ]
    OffPeakTimezone = ""
    OffPeakIdleCount = 0
    OffPeakIdleTime = 0

Thanks in advance

Edited by Max Fehmerling