Investigate and possibly improve response to GCP API rate limiting
We (gitlab.com operators) observed the throughput (in terms of VM creation) drop substantially when a GCP error resulted in our API usage quotas being cut to 2k reads per 100 seconds. At the time of the cut we were using 10k-12k reads per second. Theoretically, such throttling should have resulted in our throughput falling to a minimum of 20% of its previous value, but we observed even worse performance. This chart shows the number of running GitLab CI jobs over time during the described incident:
The large dip marks the beginning of the throttling incident. The number of running jobs was < 10% of what it had been before the incident, and is 5% of typical peak throughput today. Admittedly these are not direct docker-machine metrics, but should be a close proxy for them.
Does docker-machine obey GCP best practices for rate limiting in its GCE driver, including exponential backoff? If not then that could explain our performance issues, as rapid requests above the rate limit could seriously hamper its ability to ever get a machine provisioned.
It's possible that GitLab CI's orchestration around docker-machine was partially responsible, but it probably makes sense to verify that we're backing off correctly here.
RFC @tmaczukin