Skip to content

Transport errors connecting runner <> plugin

Since release 1.1.0 we are frequently seeing that our Runner Manager is failing to accept jobs and execute autoscaling using this plugin.

The job shows:

Preparing the "docker-autoscaler" executor
ERROR: Preparation failed: unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Will be retried in 3s ...
ERROR: Preparation failed: unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Will be retried in 3s ...
ERROR: Preparation failed: unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Will be retried in 3s ...
ERROR: Job failed (system failure): unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded

The logs of the runner-manager show the following:

decreasing instances                                amount=1 group=gce/our-gcp-project/europe-west4/gitlab-runner-mig-europe-west4 runner=aa1i234j9eaf subsystem=taskscaler
ERROR: decreasing instances                         err=rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin3873557801: connect: connection refused" group=gce/our-gcp-project/europe-west4/gitlab-runner-mig-europe-west4 runner=aa1i234j9eaf subsystem=taskscaler
ERROR: reconcile                                    err=rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin3873557801: connect: connection refused" runner=aa1i234j9eaf subsystem=taskscaler

The only way to recover from this is to restart the runner. It then works for a few hours/days until it happens again.

Environment

  • GitLab Runner version: 18.3.0
  • Fleeting plugin version: v1.1.0
  • Operating System: Linux

Context

Edited by Frank Klaassen
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information