Consider a more impactful Pod Disruption Budget value
Investigate sensible numbers that allows us to take a slight performance hit when we are rolling Pods due to maintenance. The goal here is to speed up the time it takes to perform maintenance tasks with minimal disruption. Are there other options that we can pursue that may speed up maintenance work that we ca take advantage of? One example: #1023 (closed) we may not need to do this at all.
During times of maintenance, it's common for us to drain a node. This can take a long time due to the PDB configuration. With a tolerance of only 1 Pod, we need to wait for the replacement to be Ready. For the length of time it takes for our Pods to complete init, and start passing healthchecks is not necessarily short. This time includes the
terminationGrace period, shutting down the work being processed on the old Pod, plus the time it takes for the replacement Pod. In comparison, we swap a full 110 Pod deployments w/i a few minutes, while the strict PDB constraint can take well over an hour for just 1 workload.
Alternatively, #1023 (closed)