Transport errors connecting runner <> plugin
Since release 1.1.0
we are frequently seeing that our Runner Manager is failing to accept jobs and execute autoscaling using this plugin.
The job shows:
Preparing the "docker-autoscaler" executor
ERROR: Preparation failed: unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Will be retried in 3s ...
ERROR: Preparation failed: unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Will be retried in 3s ...
ERROR: Preparation failed: unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Will be retried in 3s ...
ERROR: Job failed (system failure): unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
The logs of the runner-manager show the following:
decreasing instances amount=1 group=gce/our-gcp-project/europe-west4/gitlab-runner-mig-europe-west4 runner=aa1i234j9eaf subsystem=taskscaler
ERROR: decreasing instances err=rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin3873557801: connect: connection refused" group=gce/our-gcp-project/europe-west4/gitlab-runner-mig-europe-west4 runner=aa1i234j9eaf subsystem=taskscaler
ERROR: reconcile err=rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin3873557801: connect: connection refused" runner=aa1i234j9eaf subsystem=taskscaler
The only way to recover from this is to restart the runner. It then works for a few hours/days until it happens again.
Environment
- GitLab Runner version: 18.3.0
- Fleeting plugin version: v1.1.0
- Operating System: Linux
Context
- Possibly related: #22 (closed)
- Seen similar error in gitlab-org/fleeting/fleeting#34 (comment 2094032126)
Edited by Frank Klaassen