Issue with SSH connections from fleeting manager to runner instance on self-hosted GCP runner
Summary
After a recent gitlab-runner update, a bug arose where the docker autoscaler/fleeting manager VM would run into ssh timeouts while connecting to the runner VMs that were supposed to be running jobs. Here's a log sample for the issue:
Preparing the "docker-autoscaler" executor
03:2[5](https://gitlab.com/<group>/<repo>/-/jobs/9841298661#L5)
Dialing instance https://www.googleapis.com/compute/v1/projects/<project>/zones/<zone>/instances/<instance_id>...
ERROR: Failed to remove network for build
ERROR: Preparation failed: preparing environment: dial ssh: after retrying 0 times during 1m0s timeout: dial tcp 10.128.0.53:0: i/o timeout
Will be retried in 3s ...
Dialing instance https://www.googleapis.com/compute/v1/projects/<project>/zones/<zone>/instances/<instance_id>...[6](https://gitlab.com/<group>/<repo>/-/jobs/9841298661#L6)fd056d94a...
...
This issue rendered the cicd runner completely inoperable for any jobs that were scheduled.
The workaround I found was to use an older version of the runner. I have isolated the issue to a commit between this commit (version 17.11.0) and this commit (version 17.11.0~pre.127.g92594782)
Steps to reproduce
- Install a new runner using this Docker image or deploy v1.3.0 of this Terraform module
- Observe that jobs fail with the logs seen above
- Redeploy using the earlier Docker image or deploy v1.3.1 of the same Terraform module
- Jobs pass
I do not believe that anything in a .gitlab-ci.yml file had an impact on the bug.
Actual behavior
The runner manager using the fleeting plugin on GCP is unable to connect to the spawned runners over SSH. (While investigating the issue, I was able to successfully connect to the runner VM over SSH from the manager VM. To me, this indicates that the issue is not a networking or SSH-native issue, but is rather within the runner itself.)
Expected behavior
The manager VM is able to connect to its spawned runners.
Relevant logs and/or screenshots
job log
Preparing the "docker-autoscaler" executor
03:2[5](https://gitlab.com/<group>/<repo>/-/jobs/9841298661#L5)
Dialing instance https://www.googleapis.com/compute/v1/projects/<project>/zones/<zone>/instances/<instance_id>...
ERROR: Failed to remove network for build
ERROR: Preparation failed: preparing environment: dial ssh: after retrying 0 times during 1m0s timeout: dial tcp 10.128.0.53:0: i/o timeout
Will be retried in 3s ...
Dialing instance https://www.googleapis.com/compute/v1/projects/<project>/zones/<zone>/instances/<instance_id>...[6](https://gitlab.com/<group>/<repo>/-/jobs/9841298661#L6)fd056d94a...
...
Environment description
- Self-hosted
- Docker Autoscaler executor
- Deployed on GCP
- v17.11.0
config.toml contents
[[runners]]
name = "${RUNNER_NAME}"
url = "https://gitlab.com"
token = "${TOKEN}"
# uncomment for Windows Images when the Runner manager is hosted on Linux
environment = ["FF_USE_FASTZIP=1", "DOCKER_TLS_CERTDIR=", "DOCKER_HOST=tcp://docker:2375"]
executor = "docker-autoscaler"
limit = ${LIMIT}
# Docker Executor config
[runners.docker]
image = "${DEFAULT_IMAGE}"
privileged = true
tls_verify = false
# Autoscaler config
[runners.autoscaler]
plugin = "googlecloud:latest"
capacity_per_instance = 1
max_use_count = 1
max_instances = ${LIMIT}
[runners.autoscaler.plugin_config] # plugin specific configuration (see plugin documentation)
name = "${INSTANCE_GROUP_NAME}" # GCP Instance Group name
project = "${PROJECT_ID}"
zone = "${ZONE}"
[runners.autoscaler.connector_config]
username = "core"
use_external_addr = false
use_static_credentials = true
key_path = "/root/.ssh/id_rsa"
timeout = "1m"
[[runners.autoscaler.policy]]
idle_count = 0
idle_time = "5m0s"
[runners.cache]
Type = "gcs"
path = "${CACHE_PATH}"
Shared = true
[runners.cache.gcs]
BucketName = "${BUCKET_NAME}"
Used GitLab Runner version
Version: 17.11.0 Git revision: 0f67ff19 Git branch: 17-11-stable GO version: go1.23.6 X:cacheprog Built: 2025-04-14T10:18:18Z OS/Arch: linux/amd64
Possible fixes
I'm not sure what is causing the issue, I just know that it is between the 2 commits I linked to above. Again, the workaround I know of is to use a prerelease version of the runner.