[meta] Introduce new alerts for CI

This is a meta issue that will track all alerting needed from CI perspective:

  • generic node_load1 alert: runbooks!402 (follow-up for #3157 (moved))
  • alerting for unusual proportions of different states and stages while processing jobs on Runners (follow-up for #3157 (moved))
  • alerting for high disk utilization (follow-up for #3157 (moved))
  • improve alerting for CI runners outage (#2631 (closed))
  • improve Runner's cache servers monitoring (#2116 (closed), #1608 (closed))
  • introduce metrics and alerting for DO token limits (#1452 (closed))
  • alert for Runner's crashes (#1074 (closed))
Edited Nov 07, 2017 by Tomasz Maczukin
Assignee Loading
Time tracking Loading