AWS rate limit prevents using already-started Docker Machine instances

Summary

The fixed implementation in !909 (merged) adds caching for machine statuses, which prevents getting into a Rate Limit scenario most of the time. However, bypassing the cache when using a runner will still prevent using existing machines when AWS is already rate limiting consumers, caused by other reasons. We observed a couple of days back where we had many available instances running across GitLab Runner instances, but only about 10% of them were being used.

I attribute this to the fact that CanConnect will do a docker-machine config which makes an AWS DescribeInstances call. If that fails, the machine will be marked as unreachable and a new one created, even if it can be connected to by its last-known IP. That will also fail, since AWS is failing most API calls because of the rate limit.

This situation will not self-resolve. Failing requests use up more of the API quota than successful ones. The only solution is to stop all running GitLab Runner instances, wait about 30 minutes, then resume.

Steps to reproduce

Start any number of GitLab Runners.
Ensure a healthy number of concurrent jobs (maybe start with ~100)
Trigger a rate limit scenario. Either increase the number of concurrent jobs to ~300, or keep calling DescribeInstances (a rate of about 50k requests per hour should trigger this)

Actual behavior

Observe that the gitlab_runner_autoscaling_machine_states metric has the majority of the instance in the acquired and creating state, and only about 10-15% of instances in the used state.

Expected behavior

The GitLab Runner should be able to utilise the existing pool of machines. In a rate limit scenario, if the pool of machines cannot grow, that would be acceptable. However, that the existing machines cannot be used, is unacceptable.

Possible Fixes / PoC

Fixes landing in 12.5 to remediate this issue

Reduce the number of API calls docker-machine makes so here we are preventing from ever reaching the API limits
Populate list of machines that are not persisted to disk yet if we ever reach the API limits to prevents GitLab Runner form ever creating too many machines. Also, this prevents GitLab Runner from creating too many machines in the first place to help prevent reaching API limits.

Edited Oct 22, 2019 by Steve Xuereb