Gracefully Handle API failures
Today we had a major outage of our GCE runners due to overloaded runner-managers. This seemed to start around 8AM UTC on 2017-01-28. The machines began having difficulty creating or deleting GCE instances for some reason. We had disabled other shared-runners to see how GCE runners faired on their own, so builds simply stopped working.
The gitlab-runner should more gracefully handle errors like this.
My suggestion is that if the GCE API returns a 404, stop trying to delete it or force delete it to clear it out. With the many failed machines, we began to absolutely hammer the GCE API making everything worse and builds simply stopped processing. This has left us with many stale builds stuck in running state as the runners still tried to pick up new builds.
Related Issue: https://gitlab.com/gitlab-com/infrastructure/issues/1079