observed occurences where cluster-maxunavailable is defeated

Context

The Sylva misc-controller-suite/cluster-maxunavailable controller, which is meant to avoid having more than 1 node being rebuilt any point in time, works today based on the count of available/unavailable replicas published in the status of RKE2ControPlane and MachineSet resources

Problem

There appear to be time windows where these counters do not reflect the fact that a Machine resource is being deleted, with as a result a premature conclusion that no unavailable machine remains

CI testing

We lack CI to check that this controller provides the expected guarantees:

  • automated check
  • CI setup (non-emulated capm3?) where the problematic time window would be significant

Proposed solution

We need to refactor the misc-controller-suite/cluster-maxunavailable to not depend on the status of RKE2ControPlane and MachineSet resources

Assignee Loading
Time tracking Loading