investigate application behaviour during voluntary disruptions (e.g. draining a node for an update)
This issue is a continuation of discussion that started here: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1837#note_86073
The purpose of this issue is to investigate (and potentially document?) the behaviour of different Gitlab components during voluntary disruptions (shut down). This kind of review could become part of production readiness review performed prior to moving workloads to k8s.
see also: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
-
verify that podDisruptionBudgets exist for current workloads: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1837#note_86102 -
test shut down behavior of all components currently deployed: -
when idle -
when under heavy load
-
-
ensure we're using the right terminationGracePeriodSeconds(podDisruptionBudgets should already be configured) -
update documentation related to admin tasks: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/k8s-gitlab.md#cycling-node-pools -
add this sort of verification to readiness review
Edited by Michal Wasilewski