Monitor kubernetes node CPU wait / noisy neighbour
In our Kubernetes clusters, I have observed more frequent situations where nodes become saturated and exhibit high CPU wait times ( `node_schedstat_waiting_seconds_total` / `node_pressure_cpu_waiting_seconds_total` ), which can lead to degraded performance across all pods on the node. The root cause is often a noisy neighbour problem, where a pod(s) with either: - No CPU limits, or - Significantly different resource consumption than requested can consume excessive CPU cycles, causing contention for CPU resources at the node level. Since Kubernetes allows scheduling based on requests (guaranteeing only a minimum CPU allocation) — but does not enforce strict usage unless limits are also set — a single pod can exhaust CPU availability and impact the entire node. There are many reasons we have not set limits on workloads in the past, particularly with experiences around pod throttling due to bad limits, and fears around very latency sensitive workloads. ## Why This Matters - Node Saturation Impacts All Pods: Even well-behaved pods with modest resource needs suffer degraded performance due to CPU contention. - Critical Services Affected: Important workloads can be slowed or destabilized unpredictably. - Can be miss diagnosed: CPU wait issues do not necessarily correlate with pod-level resource metrics, making node-level monitoring essential. This can often lead to engineers thinking pods or workloads need more CPU assigned, as the container CPU requests will appear to go up, but it is a flow on affect form not being able to schedule CPU. Particularly in latency sensitive workloads - Prevention is Difficult Without Visibility: Without monitoring CPU wait and identifying noisy neighbours, we are reactive instead of proactive in cluster stability. Rather than avoiding the problem we are only able to react to it when it occurs. ## What should we do? We do have alerts currently for pod throttling, but we ideally don't want to be paging EoC any time a host experiences poor CPU wait for a sustained period. While we should come up with ideas of what this could look like, it would be nice if we also had a way to track occurrences of this in Tamland, which over time could further point to workloads needing proper revision around workload placement and/or CPU requests/limits.
issue