Improve alerting for CI runners outages
Some ideas:
-
Alert on the rate of failed DO creates -
Alert if the number of running jobs is low, especially if the number of pending jobs is high, compared on a per-type basis (so GCE jobs won’t obscure DO failures)
In general, we should be able to alert when our provider is having a failure so we can react quicker to the situation by enabling the secondary (more expensive) shared runners.
@tmaczukin do we have enough metrics to alert on rate of failures vs rate of successes? (we are doing this for the web front end and it works really well)