AWS rate limit prevents using already-started Docker Machine instances
Summary
Follow-up from #3296 (closed).
The fixed implementation in !909 (merged) adds caching for machine statuses, which prevents getting into a Rate Limit scenario most of the time. However, bypassing the cache when using a runner will still prevent using existing machines when AWS is already rate limiting consumers, caused by other reasons. We observed a couple of days back where we had many available instances running across GitLab Runner instances, but only about 10% of them were being used.
I attribute this to the fact that CanConnect will do a docker-machine config
which makes an AWS DescribeInstances call. If that fails, the machine will be marked as unreachable and a new one created, even if it can be connected to by its last-known IP. That will also fail, since AWS is failing most API calls because of the rate limit.
This situation will not self-resolve. Failing requests use up more of the API quota than successful ones. The only solution is to stop all running GitLab Runner instances, wait about 30 minutes, then resume.
Steps to reproduce
- Start any number of GitLab Runners.
- Ensure a healthy number of concurrent jobs (maybe start with ~100)
- Trigger a rate limit scenario. Either increase the number of concurrent jobs to ~300, or keep calling DescribeInstances (a rate of about 50k requests per hour should trigger this)
Actual behavior
Observe that the gitlab_runner_autoscaling_machine_states
metric has the majority of the instance in the acquired and creating state, and only about 10-15% of instances in the used state.
Expected behavior
The GitLab Runner should be able to utilise the existing pool of machines. In a rate limit scenario, if the pool of machines cannot grow, that would be acceptable. However, that the existing machines cannot be used, is unacceptable.
Possible Fixes / PoC
- Check docker-machine instances using Docker API
- Cap maximum Docker Machine provisioning rate
- Populate a list of machines with machines that might not yet be persisted on disk
Fixes landing in 12.5 to remediate this issue
- Reduce the number of API calls docker-machine makes so here we are preventing from ever reaching the API limits
- Populate list of machines that are not persisted to disk yet if we ever reach the API limits to prevents GitLab Runner form ever creating too many machines. Also, this prevents GitLab Runner from creating too many machines in the first place to help prevent reaching API limits.