Disambiguate freshly unready pods from ones that were stopped long ago
kubectl get pods -n gitlab
in production reveals many evicted pods. We have hundreds of KubePodNotReady alerts that we ignore, pertaining to these.
If we care about pod evictions, and we should, there are a few things we could do, some of which complement and some of which negate each other:
- Improve the alert so that it doesn't continue to fire long after a given eviction
- Alert only on the rate of change in unready pods
- Revise our kubernetes resources, using request=limit for memory, and a sensible request for CPU with no CPU limit (see https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/blob/master/charts/simpleapp/values.yaml#L27-46 for justification)
- Periodically clean up evicted pods from the kube API
Tangentially related to https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10842
Edited by Craig Furman