Explore options for handling AWS EKS PVC autoscaling conflicts across multiple zones
A complicated known limitation that can occur on the Kubernetes providers, particularly AWS EKS, is Cluster Autoscaler can not perform scale up actions when a pod specifically can't schedule due to it's attached PersistentVolume being in a different Availability Zone. This is due to the design of EBS disks belonging to a specific zone when initially deployed.
Example: A Gitaly pod initially deployed with PV in us-west-2a cannot schedule when only nodes exist in us-west-2b and us-west-2c, resulting in "volume node affinity conflict" errors.
This can happen frequently when the node pools have been changed outside of Cluster Autoscaler - Such as when AWS EKS performs an upgrade of the node pools.
Solutions for this are unfortunately limited due to the limitation and clashes of disparate system layers. AWS Karpenter can reportedly deal with this but it itself is a heavier AWS only option (this issue can happen on other providers). Another solution is splitting out node pools per AZ, which is to be explored.