Review kubernetes container resource saturation monitoring

Why?

As we move more services into kubernetes, it might be a good time to validate that the monitoring and alerting of these services is sane. After all, it would be embarassing if the observability team didn't have good observability into the observability stack.

Service-level saturation

Most of our monitoring is oriented around the 4 golden signals. Some dimensions of service saturation are generic across services: e.g. node-level CPU, memory saturation. Service-level saturation metrics for VM-deployed services are derived from generic, system-oriented saturation metrics, often collected from the node exporter.

Kubernetes pod-deployed services have 2 important resources (per container): memory and cpu cgroup constraints.

Kubernetes container resources

A very brief summary of kubernetes container resources and how we use them. Statements like "we tend to" are far from universal, but are true at least some of the time.

CPU

CPU requests are used to do 2 thing:

map directly to cgroup CPU shares
Inform the scheduler how much (fractional) core usage this container might use

CPU limits map directly to cgroup CPU CFS throttling, which places an absolute limit on the amount of time per second a process can execute. This can cause application stalls needlessly, and as such we avoid it.

CPU shares only define the ratio of CPU time allocated to processes when the node is under CPU pressure. If nothing else is trying to use the CPU, the constrained process can use 100% of the CPU.

Kubernetes will not schedule pods on a node with fewer vCPUs than the sum of its scheduled CPU requests: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-requests-are-scheduled

Memory

We tend to set memory request == limit, so that capacity planning for memory is easy to reason about.

Memory limits map directly to memory cgroup limits. If the container exceeds its allocation, the kernel OOM killer will kill a process in that cgroup, almost always terminating the container.

Monitoring

We should validate that:

There is appropriate pod/container-level monitoring of all kubernetes resources (e.g. from kube-state-metrics).
Services have sensible saturation points defined for kubernetes concerns (preferably in the metrics-catalog)
We have appropriate alerting in place for services that are saturated on kubernetes CPU reqs / memory limits
We have monitoring/alerting at the kubernetes host level for resource saturation
We have monitoring/alerting in place to inform us when the cluster is almost (say, 90%) committed on memory or CPU requests, which would prevent scheduling of more pods.

@gitlab-com/gl-infra/sre-observability