Skip to content

Remove docker-machine provision on creation failure

Steve Xuereb requested to merge docker-machine-create into master

What does this MR do?

Remove docker-machine provision on creation failure

Why was this MR needed?

In #27613 (closed) (comment 522938477) we see machines failing to create on the cloud provider, but it fails for rate limit reasons or anything else. This leads to a split-brain issue where we have state of a machine "created" in docker-machine state but the machine is nonexistent/inaccessible to the cloud provider.

If the creation fails docker-machine provision runs on a machine that doesn't exist which wastes API requests, and time trying to wait for SSH to be available on a machine. The docker-machine provision tries to re-provision an existing machine, but if creation fails that means that the machine doesn't exist.

Instead of retrying to create/provision the machine with the same details fail fast, remove the machine and try to create a new one with details. This should result in much quicker recovery when we reach the API rate limits on the cloud provider and also job queues because the machine would be available sooner rather than waiting for docker-machine provision to be available.

What's the best way to test this MR?

There is no easy way to force a creation failure mid creation, and it's not always reproducible because it needs to be at the moment where the local state exists and remote state doesn't. Below is an attempt at this.

  1. Start docker+machine runner configured like below to provision VMs on GCP.

    config.toml
    [[runners]]
      name = "docker-machine"
      url = "http://gdk.test:3000/"
      token = ""
      executor = "docker+machine"
      clone_url = "http://192.168.190.160:3000"
      [runners.docker]
        tls_verify = false
        image = "alpine:3.12"
        privileged = false
        disable_entrypoint_overwrite = false
        oom_kill_disable = false
        disable_cache = false
        volumes = ["/cache"]
        shm_size = 0
      [runners.machine]
        IdleCount = 0
        MachineDriver = "google"
        MachineName = "%s-steveazz"
        MachineOptions = ["google-project=group-verify-df9383", "google-disk-size=50", "google-disk-type=pd-ssd", "google-machine-type=n1-standard-1", "engine-registry-mirror=https://mirror.gcr.io"]
  2. Open up the GCP console with the configured project opened.

  3. Start a job, and as soon as you see Waiting for Instance in the runner logs, delete the machine in GCP

    Screenshot_2021-03-15_at_15.25.20

    Checking for jobs... received                       job=2202 repo_url=http://gdk.test:3000/root/playground.git runner=RP-KCb2f
    Running pre-create checks...                        driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create
    (runner-rp-kcb2f-1615817278-568560df-steveazz) Check that the project exists  driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create
    (runner-rp-kcb2f-1615817278-568560df-steveazz) Check if the instance already exists  driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create
    Creating machine...                                 driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create
    (runner-rp-kcb2f-1615817278-568560df-steveazz) Generating SSH Key  driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create
    (runner-rp-kcb2f-1615817278-568560df-steveazz) Creating host...  driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create
    (runner-rp-kcb2f-1615817278-568560df-steveazz) Opening firewall ports  driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create
    (runner-rp-kcb2f-1615817278-568560df-steveazz) Creating instance  driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create
    (runner-rp-kcb2f-1615817278-568560df-steveazz) Waiting for Instance  driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create #<-------- Delete when you see this log message
  4. You should see the following logs, where it tries to re-provision the machine and waiting for SSH to be available (this is a dead-end because the machine doesn't exist!)

    ERROR: Error creating machine: Error in driver during machine creation: operation error: {RESOURCE_NOT_FOUND  The resource 'projects/group-verify-df9383/zones/us-central1-a/instances/runner-rp-kcb2f-1615817278-568560df-steveazz' was not found [] []}  driver=google name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=create
    WARNING: Machine creation failed, trying to provision  error=exit status 1 name=runner-rp-kcb2f-1615817278-568560df-steveazz
    Waiting for SSH to be available...                  name=runner-rp-kcb2f-1615817278-568560df-steveazz operation=provision
  5. Terminate the runner

  6. Delete local reference of the machine: rm -rf ~/.docker/machine/machines/* (make sure you don't have any other machine that you need)

  7. Start the process again but this time specify the FF_SKIP_DOCKER_MACHINE_PROVISION_ON_CREATION_FAILURE feature flag

    config.toml
    [[runners]]
      name = "docker-machine-gpu"
      url = "https://gitlab.com/"
      token = ""
      executor = "docker+machine"
      [runners.feature_flags]
        "FF_SKIP_DOCKER_MACHINE_PROVISION_ON_CREATION_FAILURE" = true #<----- Enable feature flag
      [runners.docker]
        image = "ubuntu"
        privileged = false
        volumes = ["/cache"]
      [runners.machine]
        IdleCount = 0
        MachineDriver = "google"
        MachineName = "steveazz-%s"
        MachineOptions = [
          "google-project=group-verify-df9383",
          "google-disk-size=50",
          "google-disk-type=pd-ssd",
          "google-machine-type=n1-standard-1",
          "engine-registry-mirror=https://mirror.gcr.io",
          "engine-opt=mtu=1460",
          "engine-opt=ipv6",
          "engine-opt=fixed-cidr-v6=fc00::/7",
        ]
    runner system log analysis
    # Got job
    Checking for jobs... received                       job=1101059904 repo_url=https://gitlab.com/steveazz/playground.git runner=g_dsjkxZ
    
    # Creating first machine `runner-gdsjkxz-steveazz-1615878835-1abd984b`
    Running pre-create checks...                        driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create
    (runner-gdsjkxz-steveazz-1615878835-1abd984b) Check that the project exists  driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create
    (runner-gdsjkxz-steveazz-1615878835-1abd984b) Check if the instance already exists  driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create
    Creating machine...                                 driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create
    (runner-gdsjkxz-steveazz-1615878835-1abd984b) Generating SSH Key  driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create
    (runner-gdsjkxz-steveazz-1615878835-1abd984b) Creating host...  driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create
    (runner-gdsjkxz-steveazz-1615878835-1abd984b) Opening firewall ports  driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create
    (runner-gdsjkxz-steveazz-1615878835-1abd984b) Creating instance  driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create
    (runner-gdsjkxz-steveazz-1615878835-1abd984b) Waiting for Instance  driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create
    
    # Machine deleted by hand in GCP console right now
    ERROR: Error creating machine: Error in driver during machine creation: operation error: {RESOURCE_NOT_FOUND  The resource 'projects/group-verify-df9383/zones/us-central1-a/instances/runner-gdsjkxz-steveazz-1615878835-1abd984b' was not found [] []}  driver=google name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=create
    
    # We can see the log from https://gitlab.com/gitlab-org/gitlab-runner/-/blob/4d60bcaab71374c026252605bfc5a9ecbd993294/executors/docker/machine/provider.go#L119
    Skipping provision retry on failed machine          error=exit status 1 name=runner-gdsjkxz-steveazz-1615878835-1abd984b
    ERROR: Machine creation failed                      error=exit status 1 name=runner-gdsjkxz-steveazz-1615878835-1abd984b time=3m15.250528188s
    WARNING: Requesting machine removal                 lifetime=3m15.250572207s name=runner-gdsjkxz-steveazz-1615878835-1abd984b now=2021-03-16 08:17:10.556802 +0100 CET m=+261.255227168 reason=Failed to create used=3m15.250572444s usedCount=0
    WARNING: Stopping machine                           lifetime=3m24.401840107s name=runner-gdsjkxz-steveazz-1615878835-1abd984b reason=Failed to create used=47.110277ms usedCount=0
    Stopping "runner-gdsjkxz-steveazz-1615878835-1abd984b"...  name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=stop
    Running pre-create checks...                        driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Check that the project exists  driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    ERROR: googleapi: Error 404: The resource 'projects/group-verify-df9383/zones/us-central1-a/instances/runner-gdsjkxz-steveazz-1615878835-1abd984b' was not found, notFound  name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=stop
    WARNING: Error while stopping machine               error=exit status 1 lifetime=3m26.05819186s name=runner-gdsjkxz-steveazz-1615878835-1abd984b reason=Failed to create used=1.703461871s usedCount=0
    WARNING: Removing machine                           lifetime=3m26.058247639s name=runner-gdsjkxz-steveazz-1615878835-1abd984b reason=Failed to create used=1.70351632s usedCount=0
    About to remove runner-gdsjkxz-steveazz-1615878835-1abd984b  name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=remove
    WARNING: This action will delete both local reference and remote instance.  name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=remove
    (runner-gdsjkxz-steveazz-1615878835-1abd984b) Deleting instance.  name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=remove
    (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Check if the instance already exists  driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    
    # New machine creating instead of running `docker-machine provision`
    # that would have never succesdded
    Creating machine...                                 driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Generating SSH Key  driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Creating host...  driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Opening firewall ports  driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Creating instance  driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    (runner-gdsjkxz-steveazz-1615878835-1abd984b) Remote instance does not exist, proceeding with removing local reference  name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=remove
    Successfully removed runner-gdsjkxz-steveazz-1615878835-1abd984b  name=runner-gdsjkxz-steveazz-1615878835-1abd984b operation=remove
    Machine removed                                     lifetime=3m27.370799498s name=runner-gdsjkxz-steveazz-1615878835-1abd984b now=2021-03-16 08:17:22.676749 +0100 CET m=+273.375472400 reason=Failed to create retries=0 used=3.016069639s usedCount=0
    (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Waiting for Instance  driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    (runner-gdsjkxz-steveazz-1615879040-fb8f7eeb) Uploading SSH Key  driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    Waiting for machine to be running, this may take a few minutes...  driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    Detecting operating system of created instance...   driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    Waiting for SSH to be available...                  driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    Detecting the provisioner...                        driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    Provisioning with ubuntu(systemd)...                driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    Installing Docker...                                driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    Copying certs to the local machine directory...     driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    Copying certs to the remote machine...              driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    Setting Docker configuration on the remote daemon...  driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    Checking connection to Docker...                    driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    Docker is up and running!                           driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    To see how to connect your Docker Client to the Docker Engine running on this virtual machine, run: docker-machine env runner-gdsjkxz-steveazz-1615879040-fb8f7eeb  driver=google name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb operation=create
    Skipping provision retry on failed machine          error=<nil> name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb
    Machine created                                     duration=2m16.766447559s name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb now=2021-03-16 08:19:37.427245 +0100 CET m=+408.129277143 retries=0
    Created docker-machine                              created=2021-03-16 08:17:20.664122 +0100 CET m=+271.362795889 docker=tcp://34.70.78.2:2376 job=1101059904 name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb now=2021-03-16 08:19:38.198272 +0100 CET m=+408.900322644 project=10644153 runner=g_dsjkxZ usedcount=1
    Starting docker-machine build...                    created=2021-03-16 08:17:20.664122 +0100 CET m=+271.362795889 docker=tcp://34.70.78.2:2376 job=1101059904 name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb now=2021-03-16 08:20:14.712678 +0100 CET m=+445.415625724 project=10644153 runner=g_dsjkxZ usedcount=1
    Finished docker-machine build: <nil>                created=2021-03-16 08:17:20.664122 +0100 CET m=+271.362795889 docker=tcp://34.70.78.2:2376 job=1101059904 name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb now=2021-03-16 08:20:29.597103 +0100 CET m=+460.300416482 project=10644153 runner=g_dsjkxZ usedcount=1
    Cleaned up docker-machine                           created=2021-03-16 08:17:20.664122 +0100 CET m=+271.362795889 docker=tcp://34.70.78.2:2376 job=1101059904 name=runner-gdsjkxz-steveazz-1615879040-fb8f7eeb now=2021-03-16 08:20:30.086969 +0100 CET m=+460.790294736 project=10644153 runner=g_dsjkxZ usedcount=1
    Job succeeded                                       duration=6m34.78749474s job=1101059904 project=10644153 runner=g_dsjkxZ

What are the relevant issue numbers?

closes #27613 (closed)

Edited by Steve Xuereb

Merge request reports