docker+machine executor fails sometimes in "preparing environment"

Summary

docker+machine runner failure, maybe once a week.

Steps to reproduce

This appears to be a sporadic failure. I'm open to suggestions for additional instrumentation or logging that would help track down the issue

Actual behavior

Occasionally we will see

ERROR: Job failed (system failure): prepare environment: Cannot connect to the Docker daemon at tcp://xx.xx.xx.xx:2376. Is the docker daemon running? (docker.go:705:120s). Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

The IP address given appears to be of the VM that is started and stopped by the docker+machine runner, as it does not match the IP address of the runner itself.

Expected behavior

The runner should not crash

Environment description

This uses a docker+machine runner in AWS

config.toml contents
concurrent = 8     
check_interval = 0
                   
[session_server]      
  session_timeout = 1800            
                           
[[runners]]
  limit = 4                                
  name = "aws-vm-autoscaling-runner"
  url = "https://gitlab.com/"
  token = "xxxxxxx-xxxxxxxxxxxx"
  executor = "docker+machine"             
  environment = ["DOCKER_AUTH_CONFIG={\"credHelpers\":{\"xxxx.dkr.ecr.us-west-2.amazonaws.com\":\"ecr-login\"}}"]
  [runners.custom_build_dir]                      
  [runners.cache]                                                     
    [runners.cache.s3]                     
    [runners.cache.gcs]            
  [runners.docker]
    tls_verify = false
    image = "docker"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = [
      "/cache",
      "/var/run/docker.sock:/var/run/docker.sock",
      "/home/ubuntu/.docker:/root/.docker",
      "/usr/local/bin/docker-credential-ecr-login:/usr/local/bin/docker-credential-ecr-login"
    ]
    shm_size = 0
  [runners.machine]
    IdleCount = 1
    IdleTime = 2400
    OffPeakPeriods = [
      "* * 0-8,19-23 * * mon-fri *",
      "* * * * * sat,sun *"
    ]
    OffPeakTimezone = "America/Los_Angeles"
    OffPeakIdleCount = 0
    OffPeakIdleTime = 600
    MachineDriver = "amazonec2"
    MachineName = "gitlab-ci-autoscale-%s"
    MachineOptions = [
      "amazonec2-access-key=xxxxx",
      "amazonec2-secret-key=xxxxx",
      "amazonec2-instance-type=m4.2xlarge",
      "amazonec2-region=us-west-2",
      "amazonec2-vpc-id=vpc-xxxxx",
      "amazonec2-iam-instance-profile=GitLabCI",
      "amazonec2-ami=ami-xxxxx",
      "amazonec2-root-size=32",
      "amazonec2-tags=project,ci"
    ]

Docker version:

Client:
 Version:           18.09.7
 API version:       1.39
 Go version:        go1.10.1
 Git commit:        2d0083d
 Built:             Fri Aug 16 14:20:06 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.09.7
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.1
  Git commit:       2d0083d
  Built:            Wed Aug 14 19:41:23 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Used GitLab Runner version

Version:      13.1.0
Git revision: 6214287e
Git branch:   13-1-stable
GO version:   go1.13.8
Built:        2020-06-19T21:12:22+0000
OS/Arch:      linux/amd64

We were also seeing the same issue on an older version of the gitlab runner - I updated the runner a few weeks ago to see if it would resolve the problem. It has not.

Edited by Peter Baughman