Backup CronJob "safe-to-evict=false" autoscaler annotation preventing scheduling in GKE Autopilot
Summary
Our backup CronJob is currently hardcoded with cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
which results in a FailedCreate
in GKE Autopilot clusters with the following error:
Warning FailedCreate 2m3s (x7 over 12m) job-controller Error creating: admission webhook "policycontrollerv2.common-webhooks.networking.gke.io" denied the request: GKE Policy Controller rejected the request because it violates one or more policies: {"[denied by autogke-node-affinity-selector-limitation]":["Auto GKE disallows use of cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation on workloads"]}
Steps to reproduce
- Create a GKE Autopilot cluster
- Install the GitLab Helm Chart with the backup cron enabled (using every 5 minutes to trigger the failure fast)
gitlab:
toolbox:
backups:
cron:
enabled: true
schedule: "5 * * * *"
- Observe the
FailedCreate
events for thegitlab-toolbox-backup
job.
Discussion
Having GKE evict the backup pod seems problematic, and it makes sense that the Job pod is marked safe-to-evict=false
.
However, depending on the cluster configuration, this may be a low probability event and might be worth the risk here.
Given that we also have blogged about using the Helm Chart with Autopilot and that this pod is the only one in the chart with an explicit cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
annotation - would it make sense to add something like a gitlab.toolbox.backups.cron.allowEviction
conditional to leave out the annotation so that it can be scheduled in Autopilot?