Runner fails to clean up containers after job timeout due to shared context expiry (Docker Autoscaler + AWS Fleeting)
Summary
When GitLab jobs time out while using the Docker Autoscaler executor and AWS fleeting plugin, the associated Docker containers are not being properly cleaned up. These orphaned containers continue to consume resources, leading to EC2 instance degradation and failed subsequent jobs. The root cause appears to be that the SSH context used for cleanup is cancelled prematurely when the job times out, preventing volume and container cleanup.
Steps to reproduce
It is difficult to consistently reproduce, but this simplified example highlights the behavior:
timeout-job: script: - sleep 3600 timeout: 10m
- Configure a runner using Docker Autoscaler with AWS fleeting plugin.
- Run a job that exceeds the defined timeout without
RUNNER_SCRIPT_TIMEOUT. - Observe the failure and log behavior on the EC2 instance.
Actual behavior
- The job times out at 10 minutes.
- Docker container continues to run after timeout.
- Runner tries to assign a new job to the same instance.
- New job fails due to memory exhaustion or inability to SSH (instance degraded).
- Logs show errors like
context deadline exceededduring cleanup attempts. - Instance eventually fails AWS health checks and is replaced.
Expected behavior
- When a job times out, its associated Docker container and volumes should be cleaned up properly.
- The instance should not be reused for subsequent jobs if cleanup fails or the instance is in a degraded state.
- Cleanup logic should not be impacted by the job timeout context.
Relevant logs and/or screenshots
WARNING: step_script could not run to completion because the timeout was exceeded.
ERROR: Failed to cleanup volumes
ERROR: Job failed: execution took longer than 1h0m0s seconds
{"error":"remove temporary volumes: Delete \"http://internal.tunnel.invalid/v1.44/volumes/runner-...\": context deadline exceeded","job":78048383,"level":"error","msg":"Failed to cleanup volumes"} {"job":78061752,"level":"warning","msg":"Preparation failed: preparing environment: dial ssh: after retrying 0 times during 10m0s timeout: ssh: handshake failed: read tcp ...: use of closed network connection"} {"error":"networksManager is undefined","msg":"Failed to remove network for build"}
Environment description
- Self-managed GitLab instance (17.8.5)
- Runner version 17.8.3 (some upgraded to 17.9.1 during testing)
- Using
docker-autoscalerexecutor - AWS fleeting plugin
- High job volume: ~280,000 jobs/month
- ~88 EC2 instance failures in 30 days
- Feature flag
FF_USE_FLEETING_ACQUIRE_HEARTBEATSenabled during testing
`runners name = "[REDACTED]" url = "[REDACTED]" executor = "docker-autoscaler" environment = ["FF_USE_FLEETING_ACQUIRE_HEARTBEATS=true"] token = "[REDACTED]"
[runners.cache] Type = "s3" Shared = true [runners.cache.s3] BucketLocation = "eu-west-1" BucketName = "[REDACTED]"
[runners.docker] privileged = true image = "docker" volumes = ["/cache", "/etc/docker:/etc/docker:ro", "/etc/docker/certs:/certs"] shm_size = 4294967296
[runners.autoscaler] plugin = "fleeting-plugin-aws" capacity_per_instance = 1 max_use_count = 1 max_instances = 50 [runners.autoscaler.plugin_config] config_file = "/etc/gitlab-runner/aws.config" [runners.autoscaler.connector_config] use_static_credentials = true key_path = "/etc/gitlab-runner/worker_private_key.pem" runners.autoscaler.policy idle_count = 4 idle_time = "20m0s" `
Used GitLab Runner version
Running with GitLab Runner 17.8.3
(Some runner managers upgraded to 17.9.1 for testing heartbeat feature.)
Possible fixes
- Refactor runner cleanup logic to decouple the cleanup SSH context from the job timeout context.
- Ensure that
removeContaineruses its own timeout context so it can complete cleanup even after the job times out. - Investigate and improve resilience of Docker cleanup logic under
docker-autoscaler. - Consider adding automatic detection and skipping of instances with leftover containers.
- Consider extending the
RUNNER_SCRIPT_TIMEOUTbehavior to a global fallback setting or smarter default without project-level overrides.