[meta] Introduce new alerts for CI
This is a meta issue that will track all alerting needed from CI perspective:
-
generic node_load1alert: runbooks!402 (follow-up for #3157 (moved)) -
alerting for unusual proportions of different states and stages while processing jobs on Runners (follow-up for #3157 (moved)) -
alerting for high disk utilization (follow-up for #3157 (moved)) -
improve alerting for CI runners outage (#2631 (closed)) -
improve Runner's cache servers monitoring (#2116 (closed), #1608 (closed)) -
introduce metrics and alerting for DO token limits (#1452 (closed)) -
alert for Runner's crashes (#1074 (closed))
Edited by Tomasz Maczukin