Reduce downtime on Ops Gitaly node during upgrades
We successfully moved ops.gitlab.net
off its VM in us-east1
and into a Kubernetes cluster in us-central1
. However we want to minimise downtime incurred by the cluster's Gitaly node when it gets upgraded. There are a few things we can do (courtesy of @f_santos https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/6675#note_1339841649):
-
Prevent pod eviction as part of node rotation by using the annotation "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node -
Prevent pod eviction as part of node resource saturation, add a PriorityClass
with higher priority and attach it to the pod -
Make the PV/PVC regional ( pd-balanced-regional
) so the pod can start in any zone (avoids waiting for new node to spin up)- This will require the data on the existing PV to be moved onto the new PV and so will incur downtime (a CR will be needed!)
-
Tune readiness/liveness probes and confirm theres a startupProbe to decrease pod startup time