docker-autoscaler fails jobs because it can't remove networks on terminated instances when FF_NETWORK_PER_BUILD is enabled
Summary
Jobs get failed with "Failed to remove network for build" errors if:
-
FF_NETWORK_PER_BUILDis enabled on jobs - The runner uses Docker Autoscaler with the fleeting plugin
- The instance it assigned the job to is no-longer accessible
This is not mitigated by setting FF_USE_FLEETING_ACQUIRE_HEARTBEATS to true.
Steps to reproduce
- Create an AWS autoscaling group to manage runner worker EC2 instances for the runner manager
- Configure a runner manager to:
- Use the "docker-autoscaler" backend
- Use the AWS fleeting plugin with the autoscaling group configured in step 1
- Have the
FF_USE_FLEETING_ACQUIRE_HEARTBEATSfeature flag set
- Assign this runner to a project that has a pipeline containing jobs that use the
FF_NETWORK_PER_BUILDfeature flag - Run pipelines so the fleeting plugin creates instances in the autoscaling group
- Manually terminate all instances in the autoscaling group
- Attempt to run a job that uses the
FF_NETWORK_PER_BUILDfeature flag
.gitlab-ci.yml
test:
stage: test
image:
name: ubuntu:24.04
services:
- name: selenium/standalone-chrome:4.8
alias: selenium-chrome
variables:
FF_NETWORK_PER_BUILD: 1
script:
- echo "Tests go here"
retry:
max: 2
when: runner_system_failure
Actual behavior
Jobs fail with the error "Failed to remove network for build" and are repeatedly retried on the same inaccessible worker instance, quickly burning through the retries specified on the job.
Expected behavior
The instance is marked as dead and a new instance is selected or the autoscaling group is instructed to start a new instance.
Relevant logs and/or screenshots
job log
Running with gitlab-runner 18.3.1 (5a021a1c)
on [REDACTED], system ID: [REDACTED]
Resolving secrets
Preparing the "docker-autoscaler" executor
00:09
ERROR: Failed to remove network for build
ERROR: Preparation failed: creating docker connection: creating docker tunnel: preparing environment: getting instance connect info: refreshing connect info: rpc error: code = Unknown desc = instance no longer running
Will be retried in 3s ...
ERROR: Failed to remove network for build
ERROR: Preparation failed: creating docker connection: creating docker tunnel: preparing environment: getting instance connect info: refreshing connect info: rpc error: code = Unknown desc = instance no longer running
Will be retried in 3s ...
ERROR: Failed to remove network for build
ERROR: Preparation failed: creating docker connection: creating docker tunnel: preparing environment: getting instance connect info: refreshing connect info: rpc error: code = Unknown desc = instance no longer running
Will be retried in 3s ...
ERROR: Job failed (system failure): creating docker connection: creating docker tunnel: preparing environment: getting instance connect info: refreshing connect info: rpc error: code = Unknown desc = instance no longer running
Environment description
This is a runner created using this Terraform module: https://github.com/cattle-ops/terraform-aws-gitlab-runner.
The workers are using the current GRIT AMI.
config.toml contents
concurrent = 20
check_interval = 3
sentry_dsn = ""
log_format = "json"
listen_address = ""
connection_max_age = "15m"
[[runners]]
name = "[REDACTED]"
url = "https://gitlab.com"
clone_url = ""
token = "[REDACTED]"
executor = "docker-autoscaler"
environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=true,FF_USE_FLEETING_ACQUIRE_HEARTBEATS=true"]
pre_build_script = ""
post_build_script = ""
# GitLab Runner < 17, otherwise use pre_get_sources_script
pre_clone_script = ""
pre_get_sources_script = ""
request_concurrency = 5
output_limit = 4096
limit = 20
[runners.docker]
disable_cache = false
image = "docker:18.03.1-ce"
privileged = true
pull_policy = ["always"]
shm_size = 0
tls_verify = false
volumes = ["/cache"]
[runners.docker.tmpfs]
[runners.docker.services_tmpfs]
[runners.cache]
Type = "s3"
Shared = false
[runners.cache.s3]
AuthenticationType = "iam"
ServerAddress = "s3.amazonaws.com"
BucketName = "[REDACTED]"
BucketLocation = "[REDACTED]"
Insecure = false
# Autoscaler config
[runners.autoscaler]
plugin = "aws:latest"
capacity_per_instance = 1
update_interval = "1m"
update_interval_when_expecting = "2s"
max_use_count = 100
max_instances = 20
instance_ready_command=""
[runners.autoscaler.plugin_config] # plugin specific configuration (see plugin documentation)
name = "[REDACTED]" # AWS Autoscaling Group name
[runners.autoscaler.connector_config]
username = "ubuntu"
use_external_addr = false
Used GitLab Runner version
Version: 18.3.1
Git revision: 5a021a1c
Git branch: 18-3-stable
GO version: go1.24.4 X:cacheprog
Built: 2025-09-04T15:24:16Z
OS/Arch: linux/amd64
Possible fixes
My best guess is that the heartbeat runs after the manager attempts to remove the per-build network.