Create VMs in background to speed-up the autoscaling
For the initial implementation of
autoscaler we've decided to go with the simplest way: create the VM in context of
prepare command call. While this allowed us to move forward and prepare a working test environment for (which in few days will bring us an open beta tests program on GitLab.com), it's not ideal from the users perspective.
With current implementation each job is longer for the time required to spin-up the VM that will handle it. While in our existing Shared Runners (powered by the
docker+machine executor) this time is mostly invisible for the user, because Runner creates VMs in advance and assigns a first free one from the pool for a newly started job. The user experience difference is that with our current Windows Shared Runners configuration we should expect job queue timings similar to what we can see for the Linux Shared Runners only when a big load is given on GitLab.com's CI.
When discussing this in the past the idea that we had to improve this was following:
Implement a daemon mode in
autoscaler. With this we would start
autoscaleras a separate, long-living process in exactly same way how GitLab Runner is started.
We should add configuration options similar to what we have in
docker+machineexecutor, so: number of
maximumnumber of VMs and
maximum number of jobsthat a VM can handle before being removed. For the first iteration I think we can skip supporting
Together, these three options would be responsible for creating VMs in background. Managing that the
idlenumber of machines are up and ready, managing the lifetime of the VMs and assigning free VMs when requested would be the task of the daemon mode.
When autoscaling is enabled, we should change the behavior of
autoscaler. Instead of creating/removing the VM directly they should:
- Check if
autoscaler's daemon is available.
- Connect to it and request a VM lock (for
prepare). If there will be a ready VM,
autoscaler's daemon should chose one, block it for the usage of the given job and return immediately. If not, then it should schedule a creation which of course becomes a blocking operation (so behaves as
- Connect to it and get the VM connection details (for
run). Execute the job on the VM as it's done now.
- Connect to it and requests a VM release (for
cleanup). Depending on autoscaling configuration this would either release the VM back to the pool of free VMs or trigger a VM removal. Anyway,
cleanupreturns immediately, and the removal/release happens in the background.
- Check if
To communicate between
autoscaler's commands and the
atuoscaler's daemon we should use something like gRPC.
With the above we would decouple the VMs management from job execution as much as possible. In fact, in a slightly different way it would replicate exactly how we work now with our Linux Shared Runners