Andrew Newdigate requested to merge node_schedstat_waiting_saturation into master Apr 12, 2021

What does this do?

This adds monitoring for waiting time according to the node_schedstat_waiting_seconds_total metric.

More details about this metric can be found here: https://www.robustperception.io/cpu-scheduling-metrics-from-the-node-exporter

https://dashboards.gitlab.net/dashboard/snapshot/qTKI4OFSJt9ZxtCysD85eldLk6v2FO92?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=frontend&var-stage=main&from=now-24h&to=now

Why do this?

Several reasons: some services are clearly under-provisioned. Specifically, the frontend fleet. This metric will provide us with the ability to detect this.

Longer-term, once we've collected from more data, this will become a metric that we can forecast with Tamland, giving us long term predictions and capacity planning on this resource.

Additionally, this is an alternative measurement to the context switches approach that @Finotto proposed in !2766 (closed). I feel that this is a better metric to use, since context switches can be voluntary and involuntary, so don't always indicate utilization, and require a fairly artificial estimated ceiling value.

Edited Apr 12, 2021 by Andrew Newdigate

feat(alerts): monitor for node_schedstat_waiting_seconds_total saturation

What does this do?

Why do this?

Merge request reports