Consider adjusting our Pod Disruption Budget to increase the rate of maitenance operations
We had a recent incident were we determined that a newer calico version might have prevented the issue. We changed the regional cluster's kubernetes_version from 1.21 to 1.23 a month ago: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/4180 but the pods involved in the incident were still in 1.22.
One possibility is that we just don't get through all the node pools quickly enough with the maintenance windows we have configured. @igorwwwwwwwwwwwwwwwwwwww found aborted operations that point to that hypothesis
$ gcloud container operations list
...
operation-1666856784470-b80464a4 UPGRADE_NODES us-east1 sidekiq-urgent-other-5 Operation was aborted: operation-1666856784470-b80464a4. DONE 2022-10-27T07:46:24.470771688Z 2022-10-27T08:30:23.381373837Z
...
This could be due to our Pod Disruption Budget. We recently adjusted it precisely because we noted that rotation took a long time. We should consider if it needs further adjustments to allow for more timely updates