Backoff scaling based on ready-up error rate
Taskscaler allows for a "ready up" function to be provided that will execute a function to ensure that the instance is ready to receive jobs. If this function returns an error, the instance is ultimately deleted.
GitLab-Runner makes limited use of this function already:
- If the autoscaler doesn't use
nesting, it returns an error if the instance waspre-existing(because we can't trust the existing state of the instance - we'll later improve this by tracking an instance use count elsewhere). - If the autoscaler does use
nesting, it checks whether it can connect to thenestingdaemon.
I have an MR open at the moment which will add the config option instance_ready_command to GitLab-Runner, which can run a user provided script command to ensure that the instance is ready. This is helpful for cases where the SSH server has started, but a utility like cloud-init hasn't yet complete.
Upon testing this instance_ready_command however, I remembered that a failure in this function can cause a lot of instance churn. For example, if you have an idle_count of 10, and a instance_ready_command of exit 1, you'll constantly create 10 instances and immediately remove them at min(1 second, as fast the the cloud provider API permits).
We were working on an instance health check (#9), but I think this is more applicable to detecting an instance that has already been marked as "ready" and we slowly uncover that it's no longer viable. It's very instance specific, rather than a scaling up concern in general.
We need to backoff scaling based on the ready-up error rate.