Stuck at "Dialing instance ..." using GitLab Runner with docker-autoscaler in Azure
Summary
After trying a ton of things to just get my runner to show up, authenticate and create vms in my scale set in Azure. I finally got a connection working and the job is picked up.
However, I'm now stuck at "Dialing instance x ...." forever. I've tried it with my custom windows server image, but to rule out that Windows could be the problem I'm now using an Ubuntu image in the scale set.
I have no idea what is needed to investigate the issue and I'm also uncertain if there is anything that I can provide or what else I could try. I'd appreciate any help in that regard.
Steps to reproduce
- Create a VM in Azure based on Ubuntu Server 22.04. Log into it, install docker, log out, deprovision it and capture it into an image.
- Create a Virtual Machine Scale Set in Azure using the image and use uniform orchestration and manual scaling.
- Also add a cloud-init script to the scale set that during provisioning adds the user to the docker group.
- Create a VM in Azure based on Ubuntu Server 22.04. Log into root, install gitlab-runner, the azure fleeting plugin and register the runner.
- Edit the config.toml to make everything work.
- Start the runner.
- Start a job that will use the runner.
Actual behavior
Job is stuck at "Dialing instance x ...."
Expected behavior
Either the job goes through or I get a message telling me why it can't go through.
Relevant logs and/or screenshots
Job Log
Running with gitlab-runner 17.1.0 (fe451d5a)
on ubuntuGitlab qxwMYYfQR, system ID: s_699d82a7c6ed
Resolving secrets
Preparing the "docker-autoscaler" executor
Dialing instance 8...
After running into the timeout it tries again:
ERROR: Failed to remove network for build
ERROR: Preparation failed: preparing environment: dial ssh: after retrying 0 times: dial tcp 10.3.0.4:22: i/o timeout
Will be retried in 3s ...
Dialing instance 8...
GitLab Runner Debug Log
Using gitlab-runner --debug run I get this:Checking for jobs... received job=7138086521 repo_url=<> runner=<>
Processing chain chain-leaf=[0xc000119b80 0xc00011a680 0xc000be3700] context=certificate-chain-build resolve-full-chain=false
Added job to processing list builds=1 job=7138086521 max_builds=1 project=<> repo_url=<> time_in_queue_seconds=12
Failed to requeue the runner builds=1 max_builds=1 runner=<>
Running with gitlab-runner 17.1.0 (fe451d5a) job=7138086521 project=<> runner=<>
on ubuntuGitlab <>, system ID: job=7138086521 project=<> runner=<>
Resolving secrets job=7138086521 project=<> runner=<>
Preparing the "docker-autoscaler" executor job=7138086521 project=<> runner=<>
Preparing instance... job=7138086521 project=<> runner=<>
Dialing instance external-address= instance-id=6 internal-address=10.3.0.5 job=7138086521 project=<> runner=<> use-external-address=true
Dialing instance 6... job=7138086521 project=<> runner=<>
Feeding runners to channel builds=1 max_builds=1
Feeding runner to channel builds=1 max_builds=1 runner=<>
increasing instances response group=azure/<>/ubuntuScaleSetTest num_requested=1 num_successful=1 runner=<> subsystem=taskscaler
increase update group=azure/<>/ubuntuScaleSetTest pending=1 requesting=0 runner=<> subsystem=taskscaler total_pending=1
running removal() group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
running reconcile() group=azure/<>/ubuntuScaleSetTest init=false runner=<> subsystem=taskscaler
instance discovery cause=requested group=azure/<>/ubuntuScaleSetTest id=7 runner=<> state=creating subsystem=taskscaler
running provision() group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
Appending trace to coordinator...ok code=202 job=7138086521 job-log=0-298 job-status=running runner=<> sent-log=0-297 status=202 Accepted update-interval=1m0s
running removal() group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
running reconcile() group=azure/<>/ubuntuScaleSetTest init=false runner=<> subsystem=taskscaler
running provision() group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
running removal() group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
running reconcile() group=azure/<>/ubuntuScaleSetTest init=false runner=<> subsystem=taskscaler
......
......
instance update group=azure/<>/ubuntuScaleSetTest id=7 runner=<> state=running subsystem=taskscaler
running provision() group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
ready instance=7 runner=<> subsystem=taskscaler took=83.935382ms
Updating job... bytesize=298 checksum=crc32:a8ed549f job=7138086521 runner=<>
Submitting job to coordinator...ok bytesize=298 checksum=crc32:a8ed549f code=200 job=7138086521 job-status=running runner=<> update-interval=0s
running removal() group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
running reconcile() group=azure/<>/ubuntuScaleSetTest init=false runner=<> subsystem=taskscaler
running provision() group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
Updating job... bytesize=298 checksum=crc32:a8ed549f job=7138086521 runner=<>
Submitting job to coordinator...ok bytesize=298 checksum=crc32:a8ed549f code=200 job=7138086521 job-status=running runner=<> update-interval=0s
running removal() group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
running reconcile() group=azure/<>/ubuntuScaleSetTest init=false runner=<> subsystem=taskscaler
running provision() group=azure/<>/ubuntuScaleSetTest runner=<> subsystem=taskscaler
Updating job... bytesize=298 checksum=crc32:a8ed549f job=7138086521 runner=<>
Submitting job to coordinator...ok bytesize=298 checksum=crc32:a8ed549f code=200 job=7138086521 job-status=running runner=<> update-interval=0s
Environment description
- Ubuntu Server 22.04 that hosts the GitLab Runner with the
docker-autoscalerexecutor - Ubuntu Server 22.04 with Docker as an image for the Virtual Machine Scale Set
- GitLab Runner 17.1.0
- Azure Fleeting Plugin 0.3.0
config.toml contents
concurrent = 1
[[runners]]
name = "ubuntuGitlab"
url = "https://gitlab.com"
id = 38522903
token = "<>"
token_obtained_at = 2024-06-18T10:38:32Z
token_expires_at = 0001-01-01T00:00:00Z
executor = "docker-autoscaler"
[runners.autoscaler]
plugin = "azure:latest"
capacity_per_instance = 1
max_use_count = 1
max_instances = 1
[runners.autoscaler.plugin_config]
name = "ubuntuScaleSet1"
subscription_id = "<>"
resource_group_name = "<>"
[runners.autoscaler.connector_config]
username = "<>"
password = "<>"
use_static_credentials = true
timeout = "10m"
use_external_addr = true
[[runners.autoscaler.policy]]
idle_count = 5
idle_time = "20m0s"
[runners.docker]
tls_verify = false
image = "alpine:latest"
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/cache"]
shm_size = 0
network_mtu = 0
Used GitLab Runner version
Version: 17.1.0
Git revision: fe451d5a
Git branch: 17-1-stable
GO version: go1.22.3
Built: 2024-06-20T15:06:38+0000
OS/Arch: linux/amd64