[v1.1.0] Runner fails to acquire instances causing job queue deadlock - regression from v1.0.0

Problem Description

After upgrading from fleeting plugin v1.0.0 to v1.1.0, runners are attempting to acquire jobs but failing with timeout errors, causing a deadlock in the job queue. This prevents other jobs from being processed even when runner capacity should be available.

Environment

GitLab Runner version: 18.1.1
Fleeting plugin version: v1.1.0 (regression from v1.0.0)
Operating System: Linux (fleeting plugin host)
Runner types: Both Windows and Linux runners
Infrastructure: GCP

Expected Behavior (v1.0.0)

When runner concurrency limits are reached, new jobs should wait in queue until a runner becomes available. Jobs are processed normally without timeouts.

Actual Behavior (v1.1.0)

Runners attempt to pick up jobs but fail with "unable to acquire instance within the configured timeout of 15m0s" error. This:

Ties up concurrency slots with failed acquisition attempts
Causes jobs to timeout instead of waiting
Creates a deadlock where no jobs can proceed despite available capacity
Requires manual intervention to clear the queue

Error Message

Preparing the "docker-autoscaler" executor
ERROR: Preparation failed: unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Will be retried in 3s ...
ERROR: Preparation failed: unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Will be retried in 3s ...
ERROR: Preparation failed: unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded
Will be retried in 3s ...
ERROR: Job failed (system failure): unable to acquire instance within the configured timeout of 15m0s: context deadline exceeded

Steps to Reproduce

Configure fleeting plugin v1.1.0 with limited concurrent runner capacity
Submit multiple jobs that exceed the concurrency limit
Observe jobs failing with timeout errors instead of waiting in queue

Impact

Production CI/CD pipeline completely blocked
Jobs fail instead of waiting for available capacity
Manual intervention required to restore service

Workaround

Downgrading to fleeting plugin v1.0.0 immediately resolves the issue. Jobs properly wait in queue and process normally.

Additional Context

Used fleeting plugin since v0.0.1
No configuration changes made between v1.0.0 and v1.1.0
Issue seen affecting Linux runners (Didn't hear complaints of windows runner failures)
Problem occurs specifically under high job volume with concurrency limits

At the time I was more focused on fixing production than keeping notes of all the logs. I'll try and update with better logs after some investigation.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information