Issue with SSH connections from fleeting manager to runner instance on self-hosted GCP runner

Summary

After a recent gitlab-runner update, a bug arose where the docker autoscaler/fleeting manager VM would run into ssh timeouts while connecting to the runner VMs that were supposed to be running jobs. Here's a log sample for the issue:

Preparing the "docker-autoscaler" executor
03:2[5](https://gitlab.com/<group>/<repo>/-/jobs/9841298661#L5)
Dialing instance https://www.googleapis.com/compute/v1/projects/<project>/zones/<zone>/instances/<instance_id>...
ERROR: Failed to remove network for build
ERROR: Preparation failed: preparing environment: dial ssh: after retrying 0 times during 1m0s timeout: dial tcp 10.128.0.53:0: i/o timeout
Will be retried in 3s ...
Dialing instance https://www.googleapis.com/compute/v1/projects/<project>/zones/<zone>/instances/<instance_id>...[6](https://gitlab.com/<group>/<repo>/-/jobs/9841298661#L6)fd056d94a...
...

This issue rendered the cicd runner completely inoperable for any jobs that were scheduled.

The workaround I found was to use an older version of the runner. I have isolated the issue to a commit between this commit (version 17.11.0) and this commit (version 17.11.0~pre.127.g92594782)

Steps to reproduce

I do not believe that anything in a .gitlab-ci.yml file had an impact on the bug.

Actual behavior

The runner manager using the fleeting plugin on GCP is unable to connect to the spawned runners over SSH. (While investigating the issue, I was able to successfully connect to the runner VM over SSH from the manager VM. To me, this indicates that the issue is not a networking or SSH-native issue, but is rather within the runner itself.)

Expected behavior

The manager VM is able to connect to its spawned runners.

Relevant logs and/or screenshots

job log
Preparing the "docker-autoscaler" executor
03:2[5](https://gitlab.com/<group>/<repo>/-/jobs/9841298661#L5)
Dialing instance https://www.googleapis.com/compute/v1/projects/<project>/zones/<zone>/instances/<instance_id>...
ERROR: Failed to remove network for build
ERROR: Preparation failed: preparing environment: dial ssh: after retrying 0 times during 1m0s timeout: dial tcp 10.128.0.53:0: i/o timeout
Will be retried in 3s ...
Dialing instance https://www.googleapis.com/compute/v1/projects/<project>/zones/<zone>/instances/<instance_id>...[6](https://gitlab.com/<group>/<repo>/-/jobs/9841298661#L6)fd056d94a...
...

Environment description

  • Self-hosted
  • Docker Autoscaler executor
  • Deployed on GCP
  • v17.11.0
config.toml contents
[[runners]]
  name  = "${RUNNER_NAME}"
  url   = "https://gitlab.com"
  token = "${TOKEN}"

  # uncomment for Windows Images when the Runner manager is hosted on Linux
  environment = ["FF_USE_FASTZIP=1", "DOCKER_TLS_CERTDIR=", "DOCKER_HOST=tcp://docker:2375"]

  executor = "docker-autoscaler"
  limit    = ${LIMIT}

  # Docker Executor config
  [runners.docker]
    image         = "${DEFAULT_IMAGE}"
    privileged    = true
    tls_verify    = false

  # Autoscaler config
  [runners.autoscaler]
    plugin = "googlecloud:latest"

    capacity_per_instance = 1
    max_use_count         = 1
    max_instances         = ${LIMIT}

    [runners.autoscaler.plugin_config] # plugin specific configuration (see plugin documentation)
      name             = "${INSTANCE_GROUP_NAME}" # GCP Instance Group name
      project          = "${PROJECT_ID}"
      zone             = "${ZONE}"

    [runners.autoscaler.connector_config]
      username               = "core"
      use_external_addr      = false
      use_static_credentials = true
      key_path               = "/root/.ssh/id_rsa"
      timeout                = "1m"

    [[runners.autoscaler.policy]]
      idle_count = 0
      idle_time  = "5m0s"

  [runners.cache]
    Type = "gcs"
    path = "${CACHE_PATH}"
    Shared = true
    [runners.cache.gcs]
      BucketName = "${BUCKET_NAME}"

Used GitLab Runner version

Version: 17.11.0 Git revision: 0f67ff19 Git branch: 17-11-stable GO version: go1.23.6 X:cacheprog Built: 2025-04-14T10:18:18Z OS/Arch: linux/amd64

Possible fixes

I'm not sure what is causing the issue, I just know that it is between the 2 commits I linked to above. Again, the workaround I know of is to use a prerelease version of the runner.