investigate application behaviour during involuntary disruptions (e.g. a pod getting OOM killed)
This issue should be considered together with: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10625
This kind of review could become part of production readiness review performed prior to moving workloads to k8s.
see also: see also: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
-
consider doing "Chaos Monkey" style testing: -
kill a node abruptly (e.g. trigger a kernel panic) -
kill a pod (e.g. delete a pod) -
use chaos engineering tools/frameworks, for example: - https://github.com/litmuschaos/litmus
- https://www.gremlin.com/docs/
- chaoskube
- kube-monkey
-
Edited by Michal Wasilewski