Don't block Runner from picking new machines when deletion is stuck
Overview
In a recent production incident we saw a problem where the Runner was not picking up new jobs because docker+machine
executor was not able to delete the VM that it creates for the job. If a machine can't be removed we don't consider the job to be done, we have to wait until the machine is delete so that the job slot gets freed up and can be used by a new job.
When a machine gets stuckOnRemove we should not block the job slot, and free it up so a new job can be picked up. The Runner should retry after some time to try and delete the machine. Like this if the provider is having problems with deleting machines, we don't end up getting blocked by them but still pick up new jobs and the Runner will automatically delete the machines since it retries them when the provider fixes the issue from their end.
Things to think about
- How often should we try to delete stuck machines?
- This can end up exposing the Runner memory if we keep accepting jobs and having more machines to control.