Shared Runners slowness at 2017-12-18 07:00-12:00 UTC
At 2017-12-18 ~11:30 UTC we've noticed that pending jobs queue is very big and received first reports that jobs are staying in pending
stage for much longer than usual.
After a quick investigation we found, that Docker Machine - for some reason - have problems with provisioning new autoscaled machines. The problem seemed to be related to how Docker Machine uses DigitalOcean API. To resolve the issue for our users we've enabled our backup Shared Runners in GCE. Jobs started to be slowly processed, and at 2017-12-18 ~12:00 UTC the pending jobs queue went down from 1.3K to 250 jobs.
After this we were still looking for the root cause of the problem, but without success. After asking cloud provider support for help, and providing all data that we were able to get from our logs, we've been assured that everything looks fine. At the same time the only explanation coming from our logs was that the problem was caused by the cloud API. Together we've concluded that it needs to be some strange edge case caused by how Docker Machine handles DigitalOcean's API.
Unfortunately the problem disappeared before we were able to provide verbosed logs of Docker Machine's output.
At 2017-12-18 ~15:00 UTC machines provisioning got back to normal. After next few hours we've disabled GCE Shared Runners.