Alert when CPU utilization spikes on the primary database host
Proposal
So far we've been monitoring CPU utilization on primary database hosts manually, often as a part of reactive effort triggered by Tamland capacity alert.
To improve this we should setup a monitoring alert as described in https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/monitoring/alerts_manual.md#how-to-add-new-alerts. We can start with sending alerts to the g_database_frameworks Slack channel.
One way to do this will be to have a hard-coded threshold, for example "CPU saturation > 70%", or to make this more. flexible we can use the so called 3-sigma rule - "approximately all our "normal" data should be within 3 standard deviations of the average value of your data". (https://blog.davidvassallo.me/2021/10/01/grafana-prometheus-detecting-anomalies-in-time-series/). This will let us alert on spikes relative to the current usage.
Example PromQL query:
# "approximately all our "normal" data should be within 3 standard deviations of the average value of your data"
#
# https://blog.davidvassallo.me/2021/10/01/grafana-prometheus-detecting-anomalies-in-time-series/
(
(avg_over_time(node_pressure_cpu_waiting_seconds_total{env="gprd", type=~"patroni"}[$__rate_interval]) and on (fqdn) pg_replication_is_replica == 0)
-
(avg_over_time(node_pressure_cpu_waiting_seconds_total{env="gprd", type="patroni"}[1d]) and on (fqdn) pg_replication_is_replica == 0)
) /
(stddev_over_time(node_pressure_cpu_waiting_seconds_total{env="gprd", type="patroni"}[1d]) and on (fqdn) pg_replication_is_replica == 0)
This query using node_pressure_cpu_waiting_seconds_total, which is defined as “Total time in seconds that processes have waited for CPU time”. It’s basically matching node_cpu_seconds_total, with the advantage it’s single value per node, instead of value per cpu, so 100_ per node, and I failed to write the query for the later so that it combines them.
node_pressure_cpu_waiting_seconds_total compared to node_cpu_seconds_total:
3-sigma rule for node_pressure_cpu_waiting_seconds_total compared to node_cpu_seconds_total:
In this case we want to be alerted on the two spikes on 15th and 16th, but not the rest. If we want to catch more spikes we can lower the threshold to 2.75, or 2.5.
Implementation
-
Update existing alert to use the query from #502505 (comment 2225172399) (gitlab-com/runbooks!8295 (merged)) -
Alert for both mainandciprimary DB hosts (gitlab-com/runbooks!8295 (merged)) -
Update team to be notified to be #g_database_frameworks(gitlab-com/runbooks!8295 (merged)) -
Create runbook and update links (gitlab-com/runbooks!8431 (merged)) -
Convert to template alert

