[v1.1.0] Runner fails to acquire instances causing job queue deadlock - regression from v1.0.0
Problem Description
After upgrading from fleeting plugin v1.0.0 to v1.1.0, runners are attempting to acquire jobs but failing with timeout errors, causing a deadlock in the job queue. This prevents other jobs from being processed even when runner capacity should be available.
Environment
- GitLab Runner version: 18.1.1
- Fleeting plugin version: v1.1.0 (regression from v1.0.0)
- Operating System: Linux (fleeting plugin host)
- Runner types: Both Windows and Linux runners
- Infrastructure: GCP
Expected Behavior (v1.0.0)
When runner concurrency limits are reached, new jobs should wait in queue until a runner becomes available. Jobs are processed normally without timeouts.
Actual Behavior (v1.1.0)
Runners attempt to pick up jobs but fail with "unable to acquire instance within the configured timeout of 15m0s" error. This:
- Ties up concurrency slots with failed acquisition attempts
- Causes jobs to timeout instead of waiting
- Creates a deadlock where no jobs can proceed despite available capacity
- Requires manual intervention to clear the queue
Error Message
Preparing the "docker-autoscaler" executor
ERROR: Preparation failed: unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Will be retried in 3s ...
ERROR: Preparation failed: unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Will be retried in 3s ...
ERROR: Preparation failed: unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Will be retried in 3s ...
ERROR: Job failed (system failure): unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Steps to Reproduce
- Configure fleeting plugin v1.1.0 with limited concurrent runner capacity
- Submit multiple jobs that exceed the concurrency limit
- Observe jobs failing with timeout errors instead of waiting in queue
Impact
- Production CI/CD pipeline completely blocked
- Jobs fail instead of waiting for available capacity
- Manual intervention required to restore service
Workaround
Downgrading to fleeting plugin v1.0.0 immediately resolves the issue. Jobs properly wait in queue and process normally.
Additional Context
- Used fleeting plugin since v0.0.1
- No configuration changes made between v1.0.0 and v1.1.0
- Issue seen affecting Linux runners (Didn't hear complaints of windows runner failures)
- Problem occurs specifically under high job volume with concurrency limits
At the time I was more focused on fixing production than keeping notes of all the logs. I'll try and update with better logs after some investigation.