Instance executor's Acquire call make in wrong context
There is a problem with our recently introduced instance
executor - it may cause acquisition of the VM to happen in a context of already started job.
As this executor is our way towards replacement of Docker Machine executor, we should consider it to be a regression. As in case of Docker Machine - with IdleCount > 0
setting (which is true in case of our SaaS Linux runners), the job is requested only when there is a free VM that can handle it. VM is never requested in context of a running job.
However, when running some stress tests recently to check how metrics collection is working, I've found that few jobs have failed with such an error:
Running with gitlab-runner 15.5.0~beta.95.g5e7107ef (5e7107ef)
on fleeting-gcp-runner yA8Fs_aj
Preparing the "shell+autoscaler" executor
ERROR: Preparation failed: unable to acquire instance: context deadline exceeded
Will be retried in 3s ...
ERROR: Preparation failed: no acqusition ref data
Will be retried in 3s ...
ERROR: Preparation failed: no acqusition ref data
Will be retried in 3s ...
ERROR: Job failed (system failure): no acqusition ref data
This means, that:
-
autoscaler
provider have got theAcquire()
call, have checked the capacity and returned back the statement that "a job can be taken". - Runner have requested the job.
- Job was received, marked as
running
in GitLab and execution have started. - When it reached the point when the job preparation steps should be taken, the
autoscaler
executor have called Taskscaler'sAcquire()
method. At this moment Taskscaler was no more able to provide a place for task execution so it failed. It have repeated few times within few seconds - while a new instance provision was either no possible due to autoscaling limits or was done just few seconds too late - the job have failed.
In case of our Docker Machine executor and our SaaS Linux runners this would have never happened. If we would reach a case when there are no Idle VMs ready for users jobs, the job would be never requested. The worst case scenario would be that a number of jobs in the queue would become stuck (which also triggers some alerting for us), but never executed to fail early with an error that is totally not actionable for the user.
IMHO Taskscaler's Acquire()
is done in a wrong place. It should be done on the Executor Provider level in executors/internal/autoscaler.provider
. A call made here should "book" the task acquisition for a job, that should be returned as the common.ExecutorData
. And then in executors/internal/autoscaler.executor
and in executors/instance.executor
it should be used "as-is" (probably with an additional call to Taskscaler marking the acquisition as "in-use" - for observability reasons).
Current implementation creates a race condition case. The capacity counters used in Executor Provider's Acquire()
call are constantly changing depending on the current load. With higher volume of jobs coming in and leaving the runner process, it may happen that the numbers received in Executor Provider's Acquire()
call are no more true only few seconds later when Executor's Prepare()
call Taskscaler's Acquire()
.
In my tests I was running like 20-30 jobs concurrently and this have happened multiple times. With our SaaS runners handling up to thousand of jobs at once I expect it to happen even more often.