Skip to content

AWS rate limit prevents using already-started Docker Machine instances

Summary

Follow-up from #3296 (closed).

The fixed implementation in !909 (merged) adds caching for machine statuses, which prevents getting into a Rate Limit scenario most of the time. However, bypassing the cache when using a runner will still prevent using existing machines when AWS is already rate limiting consumers, caused by other reasons. We observed a couple of days back where we had many available instances running across GitLab Runner instances, but only about 10% of them were being used.

I attribute this to the fact that CanConnect will do a docker-machine config which makes an AWS DescribeInstances call. If that fails, the machine will be marked as unreachable and a new one created, even if it can be connected to by its last-known IP. That will also fail, since AWS is failing most API calls because of the rate limit.

This situation will not self-resolve. Failing requests use up more of the API quota than successful ones. The only solution is to stop all running GitLab Runner instances, wait about 30 minutes, then resume.

Steps to reproduce

  1. Start any number of GitLab Runners.
  2. Ensure a healthy number of concurrent jobs (maybe start with ~100)
  3. Trigger a rate limit scenario. Either increase the number of concurrent jobs to ~300, or keep calling DescribeInstances (a rate of about 50k requests per hour should trigger this)

Actual behavior

Observe that the gitlab_runner_autoscaling_machine_states metric has the majority of the instance in the acquired and creating state, and only about 10-15% of instances in the used state.

Expected behavior

The GitLab Runner should be able to utilise the existing pool of machines. In a rate limit scenario, if the pool of machines cannot grow, that would be acceptable. However, that the existing machines cannot be used, is unacceptable.

Possible Fixes / PoC

Fixes landing in 12.5 to remediate this issue