Improve alerting for CI runners outages

Some ideas:

  • Alert on the rate of failed DO creates
  • Alert if the number of running jobs is low, especially if the number of pending jobs is high, compared on a per-type basis (so GCE jobs won’t obscure DO failures)

In general, we should be able to alert when our provider is having a failure so we can react quicker to the situation by enabling the secondary (more expensive) shared runners.

@tmaczukin do we have enough metrics to alert on rate of failures vs rate of successes? (we are doing this for the web front end and it works really well)

Assignee Loading
Time tracking Loading