Runner fails to clean up containers after job timeout due to shared context expiry (Docker Autoscaler + AWS Fleeting)

Summary

When GitLab jobs time out while using the Docker Autoscaler executor and AWS fleeting plugin, the associated Docker containers are not being properly cleaned up. These orphaned containers continue to consume resources, leading to EC2 instance degradation and failed subsequent jobs. The root cause appears to be that the SSH context used for cleanup is cancelled prematurely when the job times out, preventing volume and container cleanup.

Steps to reproduce

It is difficult to consistently reproduce, but this simplified example highlights the behavior:

timeout-job: script: - sleep 3600 timeout: 10m

  1. Configure a runner using Docker Autoscaler with AWS fleeting plugin.
  2. Run a job that exceeds the defined timeout without RUNNER_SCRIPT_TIMEOUT.
  3. Observe the failure and log behavior on the EC2 instance.

Actual behavior

  • The job times out at 10 minutes.
  • Docker container continues to run after timeout.
  • Runner tries to assign a new job to the same instance.
  • New job fails due to memory exhaustion or inability to SSH (instance degraded).
  • Logs show errors like context deadline exceeded during cleanup attempts.
  • Instance eventually fails AWS health checks and is replaced.

Expected behavior

  • When a job times out, its associated Docker container and volumes should be cleaned up properly.
  • The instance should not be reused for subsequent jobs if cleanup fails or the instance is in a degraded state.
  • Cleanup logic should not be impacted by the job timeout context.

Relevant logs and/or screenshots

WARNING: step_script could not run to completion because the timeout was exceeded.
ERROR: Failed to cleanup volumes
ERROR: Job failed: execution took longer than 1h0m0s seconds

{"error":"remove temporary volumes: Delete \"http://internal.tunnel.invalid/v1.44/volumes/runner-...\": context deadline exceeded","job":78048383,"level":"error","msg":"Failed to cleanup volumes"} {"job":78061752,"level":"warning","msg":"Preparation failed: preparing environment: dial ssh: after retrying 0 times during 10m0s timeout: ssh: handshake failed: read tcp ...: use of closed network connection"} {"error":"networksManager is undefined","msg":"Failed to remove network for build"}

Environment description

  • Self-managed GitLab instance (17.8.5)
  • Runner version 17.8.3 (some upgraded to 17.9.1 during testing)
  • Using docker-autoscaler executor
  • AWS fleeting plugin
  • High job volume: ~280,000 jobs/month
  • ~88 EC2 instance failures in 30 days
  • Feature flag FF_USE_FLEETING_ACQUIRE_HEARTBEATS enabled during testing

`runners name = "[REDACTED]" url = "[REDACTED]" executor = "docker-autoscaler" environment = ["FF_USE_FLEETING_ACQUIRE_HEARTBEATS=true"] token = "[REDACTED]"

[runners.cache] Type = "s3" Shared = true [runners.cache.s3] BucketLocation = "eu-west-1" BucketName = "[REDACTED]"

[runners.docker] privileged = true image = "docker" volumes = ["/cache", "/etc/docker:/etc/docker:ro", "/etc/docker/certs:/certs"] shm_size = 4294967296

[runners.autoscaler] plugin = "fleeting-plugin-aws" capacity_per_instance = 1 max_use_count = 1 max_instances = 50 [runners.autoscaler.plugin_config] config_file = "/etc/gitlab-runner/aws.config" [runners.autoscaler.connector_config] use_static_credentials = true key_path = "/etc/gitlab-runner/worker_private_key.pem" runners.autoscaler.policy idle_count = 4 idle_time = "20m0s" `

Used GitLab Runner version

Running with GitLab Runner 17.8.3

(Some runner managers upgraded to 17.9.1 for testing heartbeat feature.)

Possible fixes

  • Refactor runner cleanup logic to decouple the cleanup SSH context from the job timeout context.
  • Ensure that removeContainer uses its own timeout context so it can complete cleanup even after the job times out.
  • Investigate and improve resilience of Docker cleanup logic under docker-autoscaler.
  • Consider adding automatic detection and skipping of instances with leftover containers.
  • Consider extending the RUNNER_SCRIPT_TIMEOUT behavior to a global fallback setting or smarter default without project-level overrides.