observed occurences where cluster-maxunavailable is defeated
Context
The Sylva misc-controller-suite/cluster-maxunavailable controller, which is meant to avoid having more than 1 node being rebuilt any point in time, works today based on the count of available/unavailable replicas published in the status of RKE2ControPlane and MachineSet resources
Problem
There appear to be time windows where these counters do not reflect the fact that a Machine resource is being deleted, with as a result a premature conclusion that no unavailable machine remains
CI testing
We lack CI to check that this controller provides the expected guarantees:
- automated check
- CI setup (non-emulated capm3?) where the problematic time window would be significant
Proposed solution
We need to refactor the misc-controller-suite/cluster-maxunavailable to not depend on the status of RKE2ControPlane and MachineSet resources