deadlock between CAPI pre-flight checks and cluster-maxunavailable

We've observed a Sylva upgrade CI run that goes into a deadlock between CAPI pre-flight checks and cluster-maxunavailable:

  • state is:
    • 3 CP Machines/Nodes
      • one in the older k8s version
      • one in the older k8s version, pending drain
      • one already in the newer k8s version
    • MachineDeployment is scaled down to zero
  • cluster-maxunavailable prevents the drain of a CP Machine because the MachineDeployment is missing one Machine
  • the MachineDeployment controller prevents the creation of a new Machine because of the pre-flight check that prevents creating MD nodes before the control plane is fully upgraded to the newer Kubernetes version

The scenario needs to be analyzed in more details, but the problem arises because the cluster-maxunavailable controller should not have allowed the drain/removal of an MD Nodes (because of https://gitlab.com/sylva-projects/sylva-elements/misc-controllers-suite/-/blob/c37ed37801b81ed198d733f6be7e5dbecdd64b59/internal/controllers/clustermaxunavailable/cluster_maxunavailable.go#L322 and https://gitlab.com/sylva-projects/sylva-elements/misc-controllers-suite/-/blob/c37ed37801b81ed198d733f6be7e5dbecdd64b59/internal/controllers/clustermaxunavailable/cluster_maxunavailable.go#L271)

NOTE: the log dumps don't have a lot of information; in particular we lack old k8s events, and we lack misc-controller logs at the moment where the MD Machine was deleted

/cc @cristian.manda @mederic.deverdilhac @feleouet

Edited Sep 12, 2025 by Thomas Morin
Assignee Loading
Time tracking Loading