tune calico-node-vertical-autoscaler ConfigMap to reduce pod restarts
In our larger GKE clusters we are experiencing a very high level of calico-node
pod refresh, causing them to frequently be replaced with a new revision.
This is happening aggressively enough in some clusters that we are seeing some network disruption.
With more sensitive pods this can manifest in the form of things like unexpected liveness probe failures.
From an open case with google support:
The behavior that you are seeing in the GKE calico daemonset, it is expected, in large clusters that do a lot of autoscaling, calico may restart frequently enough to cause disruption, and in some situations, it may be necessary to desensitize its vertical autoscaler with higher step values.
This can be accomplished by editing the calico-node-vertical-autoscaler ConfigMap. Increasing base, step and nodesPerStep by 50%, for instance, should cause the vertical autoscaler to react less often, but more intensely to changes in cluster size.
We can tweak the base requests for these larger clusters and the nodesPerStep to reduce the rate at which these calico-node pods attempt to vertically scale.