feat(alerts): monitor for node_schedstat_waiting_seconds_total saturation
What does this do?
This adds monitoring for waiting time according to the node_schedstat_waiting_seconds_total
metric.
More details about this metric can be found here: https://www.robustperception.io/cpu-scheduling-metrics-from-the-node-exporter
Why do this?
Several reasons: some services are clearly under-provisioned. Specifically, the frontend fleet. This metric will provide us with the ability to detect this.
Longer-term, once we've collected from more data, this will become a metric that we can forecast with Tamland, giving us long term predictions and capacity planning on this resource.
Additionally, this is an alternative measurement to the context switches approach that @Finotto proposed in !2766 (closed). I feel that this is a better metric to use, since context switches can be voluntary and involuntary, so don't always indicate utilization, and require a fairly artificial estimated ceiling value.