Alert/Forecast on Kubernetes pod and node resource saturation

Summary

During Kubernetes nodepool efficiency improvements in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15901+, we triggered a series of saturation events due to resource contention amongst pods (mostly CPU).

The pod requests we had in place were not realistic and we were relying on the node allocation inefficiency to compensate for saturation.

To improve our response and avoid further surprises in the future, we should alert and/or forecast on pod/node resource saturation for CPU/RAM.

Metrics of interest:

Examples of metrics being used to tune requests:

Alert/forecast on average Kubernetes pod resource saturation per workload
Alert/forecast on average Kubernetes node resource saturation per cluster/nodepool and node (higher thresholds)

Link the incident(s) this corrective action arose out of
Give context for what problem this corrective action is trying to prevent from re-occurring
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
Assign a priority (this will default to 'Reliability::P4')

Edited Jan 05, 2023 by Filipe Santos