Alert/Forecast on Kubernetes pod and node resource saturation
Summary
During Kubernetes nodepool efficiency improvements in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15901+, we triggered a series of saturation events due to resource contention amongst pods (mostly CPU).
The pod requests we had in place were not realistic and we were relying on the node allocation inefficiency to compensate for saturation.
To improve our response and avoid further surprises in the future, we should alert and/or forecast on pod/node resource saturation for CPU/RAM.
Metrics of interest:
-
node_schedstat_waiting_seconds_total
node CPU scheduling saturation -
container_cpu_cfs_throttled_seconds_total
pod/container CPU throttling -
container_cpu_usage_seconds_total
pod/container CPU usage -
container_memory_usage_bytes
pod/container RAM usage
Examples of metrics being used to tune requests:
- Experiment increase web worker count (gitlab-com/gl-infra/k8s-workloads/gitlab-com!2396 - merged)
- Improve websockets resource allocation (gitlab-com/gl-infra/k8s-workloads/gitlab-com!2379 - merged)
Related Incident(s)
- 2022-11-07: Kubernetes node-level hotspotting a... (production#8010 - closed)
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16783+
Originating issue(s): https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16783+
Desired Outcome/Acceptance Criteria
- Alert/forecast on average Kubernetes pod resource saturation per workload
- Alert/forecast on average Kubernetes node resource saturation per cluster/nodepool and node (higher thresholds)
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Reliability::P4')
Edited by Filipe Santos