2018-08-02 Shared Runners in GCP problems

We're observing today very strange behavior of half of our Shared Runners fleet in GCP. It seems that there is a network related problem in us-east1-c region.

First type of problems, is the problem of machines creation - there are regular dropdowns of number of created machines. Also the number of machines created where Runner is able to do it is also much smaller than in us-east1-d. For comparison few graphs:

us-east1-c

us-east1-d

While looking into logs I can see that most of machine creation failures are caused by different timeouts received by Docker Machine.

Also checking jobs that eventually were started on Runners based in us-east1-c I can see, that most of them hangs and finally fails on different networking operations (in most cases - on pulling the images for job container and defined services).

It clearly looks like a networking problem in GCP us-east1-c. However - at least for now - GCP status Dashboard shows that all services are operating normally.

Life graphs can be previewed at:

Edited Aug 02, 2018 by Tomasz Maczukin