docker-autoscaler fails jobs because it can't remove networks on terminated instances when FF_NETWORK_PER_BUILD is enabled

Summary

Jobs get failed with "Failed to remove network for build" errors if:

  1. FF_NETWORK_PER_BUILD is enabled on jobs
  2. The runner uses Docker Autoscaler with the fleeting plugin
  3. The instance it assigned the job to is no-longer accessible

This is not mitigated by setting FF_USE_FLEETING_ACQUIRE_HEARTBEATS to true.

Steps to reproduce

  1. Create an AWS autoscaling group to manage runner worker EC2 instances for the runner manager
  2. Configure a runner manager to:
    • Use the "docker-autoscaler" backend
    • Use the AWS fleeting plugin with the autoscaling group configured in step 1
    • Have the FF_USE_FLEETING_ACQUIRE_HEARTBEATS feature flag set
  3. Assign this runner to a project that has a pipeline containing jobs that use the FF_NETWORK_PER_BUILD feature flag
  4. Run pipelines so the fleeting plugin creates instances in the autoscaling group
  5. Manually terminate all instances in the autoscaling group
  6. Attempt to run a job that uses the FF_NETWORK_PER_BUILD feature flag
.gitlab-ci.yml
test:
  stage: test

  image:
    name: ubuntu:24.04

  services:
    - name: selenium/standalone-chrome:4.8
      alias: selenium-chrome

  variables:
    FF_NETWORK_PER_BUILD: 1

  script:
    - echo "Tests go here"

  retry:
    max: 2
    when: runner_system_failure

Actual behavior

Jobs fail with the error "Failed to remove network for build" and are repeatedly retried on the same inaccessible worker instance, quickly burning through the retries specified on the job.

Expected behavior

The instance is marked as dead and a new instance is selected or the autoscaling group is instructed to start a new instance.

Relevant logs and/or screenshots

job log
Running with gitlab-runner 18.3.1 (5a021a1c)
  on [REDACTED], system ID: [REDACTED]
Resolving secrets
Preparing the "docker-autoscaler" executor
00:09
ERROR: Failed to remove network for build
ERROR: Preparation failed: creating docker connection: creating docker tunnel: preparing environment: getting instance connect info: refreshing connect info: rpc error: code = Unknown desc = instance no longer running
Will be retried in 3s ...
ERROR: Failed to remove network for build
ERROR: Preparation failed: creating docker connection: creating docker tunnel: preparing environment: getting instance connect info: refreshing connect info: rpc error: code = Unknown desc = instance no longer running
Will be retried in 3s ...
ERROR: Failed to remove network for build
ERROR: Preparation failed: creating docker connection: creating docker tunnel: preparing environment: getting instance connect info: refreshing connect info: rpc error: code = Unknown desc = instance no longer running
Will be retried in 3s ...
ERROR: Job failed (system failure): creating docker connection: creating docker tunnel: preparing environment: getting instance connect info: refreshing connect info: rpc error: code = Unknown desc = instance no longer running

Environment description

This is a runner created using this Terraform module: https://github.com/cattle-ops/terraform-aws-gitlab-runner.

The workers are using the current GRIT AMI.

config.toml contents
concurrent = 20
check_interval = 3
sentry_dsn = ""
log_format = "json"
listen_address = ""
connection_max_age = "15m"

[[runners]]
  name = "[REDACTED]"
  url = "https://gitlab.com"

  clone_url = ""
  token = "[REDACTED]"
  executor = "docker-autoscaler"
  environment = ["FF_GITLAB_REGISTRY_HELPER_IMAGE=true,FF_USE_FLEETING_ACQUIRE_HEARTBEATS=true"]
  pre_build_script = ""
  post_build_script = ""
  # GitLab Runner < 17, otherwise use pre_get_sources_script
  pre_clone_script = ""
  pre_get_sources_script = ""
  request_concurrency = 5
  output_limit = 4096
  limit = 20

    [runners.docker]
    disable_cache = false
    image = "docker:18.03.1-ce"
    privileged = true
    pull_policy = ["always"]
    shm_size = 0
    tls_verify = false
    volumes = ["/cache"]

  [runners.docker.tmpfs]

  [runners.docker.services_tmpfs]

  [runners.cache]
    Type = "s3"
    Shared = false
    [runners.cache.s3]
      AuthenticationType = "iam"
      ServerAddress = "s3.amazonaws.com"
      BucketName = "[REDACTED]"
      BucketLocation = "[REDACTED]"
      Insecure = false

  # Autoscaler config
  [runners.autoscaler]
    plugin = "aws:latest"

    capacity_per_instance = 1
    update_interval = "1m"
    update_interval_when_expecting = "2s"

    max_use_count = 100
    max_instances = 20

    instance_ready_command=""

    [runners.autoscaler.plugin_config] # plugin specific configuration (see plugin documentation)
      name = "[REDACTED]"     # AWS Autoscaling Group name

    [runners.autoscaler.connector_config]
      username          = "ubuntu"
      use_external_addr = false

Used GitLab Runner version

Version:      18.3.1
Git revision: 5a021a1c
Git branch:   18-3-stable
GO version:   go1.24.4 X:cacheprog
Built:        2025-09-04T15:24:16Z
OS/Arch:      linux/amd64

Possible fixes

My best guess is that the heartbeat runs after the manager attempts to remove the per-build network.