Stuck at "Dialing instance ..." using GitLab Runner with docker-autoscaler in Azure

Summary

After trying a ton of things to just get my runner to show up, authenticate and create vms in my scale set in Azure. I finally got a connection working and the job is picked up.

However, I'm now stuck at "Dialing instance x ...." forever. I've tried it with my custom windows server image, but to rule out that Windows could be the problem I'm now using an Ubuntu image in the scale set.

I have no idea what is needed to investigate the issue and I'm also uncertain if there is anything that I can provide or what else I could try. I'd appreciate any help in that regard.

Steps to reproduce

  • Create a VM in Azure based on Ubuntu Server 22.04. Log into it, install docker, log out, deprovision it and capture it into an image.
  • Create a Virtual Machine Scale Set in Azure using the image and use uniform orchestration and manual scaling.
    • Also add a cloud-init script to the scale set that during provisioning adds the user to the docker group.
  • Create a VM in Azure based on Ubuntu Server 22.04. Log into root, install gitlab-runner, the azure fleeting plugin and register the runner.
  • Edit the config.toml to make everything work.
  • Start the runner.
  • Start a job that will use the runner.

Actual behavior

Job is stuck at "Dialing instance x ...."

Expected behavior

Either the job goes through or I get a message telling me why it can't go through.

Relevant logs and/or screenshots

Job Log
Running with gitlab-runner 17.1.0 (fe451d5a)
  on ubuntuGitlab qxwMYYfQR, system ID: s_699d82a7c6ed
Resolving secrets
Preparing the "docker-autoscaler" executor
Dialing instance 8...

After running into the timeout it tries again:

ERROR: Failed to remove network for build
ERROR: Preparation failed: preparing environment: dial ssh: after retrying 0 times: dial tcp 10.3.0.4:22: i/o timeout
Will be retried in 3s ...
Dialing instance 8...
GitLab Runner Debug Log Using gitlab-runner --debug run I get this:
Checking for jobs... received                       job=7138086521 repo_url=<> runner=<>
Processing chain                                    chain-leaf=[0xc000119b80 0xc00011a680 0xc000be3700] context=certificate-chain-build resolve-full-chain=false
Added job to processing list                        builds=1 job=7138086521 max_builds=1 project=<> repo_url=<> time_in_queue_seconds=12
Failed to requeue the runner                        builds=1 max_builds=1 runner=<>
Running with gitlab-runner 17.1.0 (fe451d5a)        job=7138086521 project=<> runner=<>
  on ubuntuGitlab <>, system ID:   job=7138086521 project=<> runner=<>
Resolving secrets                       job=7138086521 project=<> runner=<>
Preparing the "docker-autoscaler" executor  job=7138086521 project=<> runner=<>
Preparing instance...                               job=7138086521 project=<> runner=<>
Dialing instance                                    external-address= instance-id=6 internal-address=10.3.0.5 job=7138086521 project=<> runner=<> use-external-address=true
Dialing instance 6...                               job=7138086521 project=<> runner=<>
Feeding runners to channel                          builds=1 max_builds=1
Feeding runner to channel                           builds=1 max_builds=1 runner=<>
increasing instances response                       group=azure/<>/ubuntuScaleSetTest num_requested=1 num_successful=1 runner=<> subsystem=taskscaler
increase update                                     group=azure/<>/ubuntuScaleSetTest pending=1 requesting=0 runner=<> subsystem=taskscaler total_pending=1
running removal()                                   group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
running reconcile()                                 group=azure/<>/ubuntuScaleSetTest init=false runner=<> subsystem=taskscaler
instance discovery                                  cause=requested group=azure/<>/ubuntuScaleSetTest id=7 runner=<> state=creating subsystem=taskscaler
running provision()                                 group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
Appending trace to coordinator...ok                 code=202 job=7138086521 job-log=0-298 job-status=running runner=<> sent-log=0-297 status=202 Accepted update-interval=1m0s
running removal()                                   group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
running reconcile()                                 group=azure/<>/ubuntuScaleSetTest init=false runner=<> subsystem=taskscaler
running provision()                                 group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
running removal()                                   group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
running reconcile()                                 group=azure/<>/ubuntuScaleSetTest init=false runner=<> subsystem=taskscaler
......
......
instance update                                     group=azure/<>/ubuntuScaleSetTest id=7 runner=<> state=running subsystem=taskscaler
running provision()                                 group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
ready                                               instance=7 runner=<> subsystem=taskscaler took=83.935382ms
Updating job...                                     bytesize=298 checksum=crc32:a8ed549f job=7138086521 runner=<>
Submitting job to coordinator...ok                  bytesize=298 checksum=crc32:a8ed549f code=200 job=7138086521 job-status=running runner=<> update-interval=0s
running removal()                                   group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
running reconcile()                                 group=azure/<>/ubuntuScaleSetTest init=false runner=<> subsystem=taskscaler
running provision()                                 group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
Updating job...                                     bytesize=298 checksum=crc32:a8ed549f job=7138086521 runner=<>
Submitting job to coordinator...ok                  bytesize=298 checksum=crc32:a8ed549f code=200 job=7138086521 job-status=running runner=<> update-interval=0s
running removal()                                   group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
running reconcile()                                 group=azure/<>/ubuntuScaleSetTest init=false runner=<> subsystem=taskscaler
running provision()                                 group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
Updating job...                                     bytesize=298 checksum=crc32:a8ed549f job=7138086521 runner=<>
Submitting job to coordinator...ok                  bytesize=298 checksum=crc32:a8ed549f code=200 job=7138086521 job-status=running runner=<> update-interval=0s

Environment description

  • Ubuntu Server 22.04 that hosts the GitLab Runner with the docker-autoscaler executor
  • Ubuntu Server 22.04 with Docker as an image for the Virtual Machine Scale Set
  • GitLab Runner 17.1.0
  • Azure Fleeting Plugin 0.3.0
config.toml contents
concurrent = 1

[[runners]]
  name = "ubuntuGitlab"
  url = "https://gitlab.com"
  id = 38522903
  token = "<>"
  token_obtained_at = 2024-06-18T10:38:32Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "docker-autoscaler"
  [runners.autoscaler]
    plugin = "azure:latest"
    capacity_per_instance = 1
    max_use_count = 1
    max_instances = 1
    [runners.autoscaler.plugin_config]
      name = "ubuntuScaleSet1"
      subscription_id = "<>"
      resource_group_name = "<>"
    [runners.autoscaler.connector_config]
      username = "<>"
      password = "<>"
      use_static_credentials = true
      timeout = "10m"
      use_external_addr = true
    [[runners.autoscaler.policy]]
      idle_count = 5
      idle_time = "20m0s"
  [runners.docker]
    tls_verify = false
    image = "alpine:latest"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
    network_mtu = 0

Used GitLab Runner version

Version:      17.1.0
Git revision: fe451d5a
Git branch:   17-1-stable
GO version:   go1.22.3
Built:        2024-06-20T15:06:38+0000
OS/Arch:      linux/amd64