cluster-maxunavailable fails to hold off the drain of MD machine

(This issue was seen on release-1.4, but is quite likely also present on main.)

Summary: we have one case observed in CI where cluster-maxunavailable controller allows the drain of an MD node although a node rolling update of a CP node is still in progress.

The following was observed in https://gitlab.com/sylva-projects/sylva-core/-/jobs/11209191935:

at 22:34 cp-2 was created, it's node was up 7 minutes later (22:41)
at 22:45 cp-1 was created, it's node was up 8 minutes later (22:53)
when the debug-on-exit is taken, cp-0 Machine is stuck Provisining, and there is no cp-0 node

at this point md-0 hasn't been touched, its drain is held off by cluster-maxunavailable, which still logs:

Machine not selected for being drained: There is at least one unavailable machine, can't select any machine for drain and delete (1 CP machine unavailable, 0 MD machines unavailable)

at 22:57:22, the cluster-maxunavailable still has a valid vision of the control plane (1 CP machine unavailable)

2025-09-01T22:57:52Z    DEBUG   There is at least one unavailable machine, can't select any machine for drain and delete (1 CP machine unavailable, 0 MD machines unavailable)

at 22:57:51, the Machine for cp-0 is marked for deletion ; CAPI controller logs:

I0901 22:57:51.431471       1 machine_controller.go:617] 
"Deleting node" ... Machine="sylva-system/mgmt-2015860940-rke2-capm3-virt-control-plane-tgzsb" ...  
Node="mgmt-2015860940-rke2-capm3-virt-management-cp-0"

at 22:57:52 cp-0 Node deletion is finishing

2025-09-01T22:57:52Z	INFO	Removing finalizer from node	{"controller": "providerIDBlacklist",  "Node": {"name":"mgmt-2015860940-rke2-capm3-virt-management-cp-0"}

at 22:58:04 the cluster-maxunavailable controller considers that all CP machines are available

this is wrong ❗

2025-09-01T22:58:04Z    DEBUG   unavailable machines  ... "control-plane": 0, "machine-deployments": 0}

this coinicides with this log of RKE2 controller:

I0901 22:58:04.719148       ... "Successfully updated RKE2ControlPlane status" ....

accordingly the cluster-maxunavailable removes pre-drain hook on Machine for md0

2025-09-01T22:58:04Z    INFO    Will let machine be drained and deleted ... "machine": "mgmt-2015860940-rke2-capm3-virt-md0-zbk5n-jsb6z"}

at 22:58:05 :

rke2 cp controller logs:

I0901 22:58:05.031795       1 workload_cluster_etcd.go:153]
    "Removed member: mgmt-2015860940-rke2-capm3-virt-management-cp-0-2657246b" ...
I0901 22:58:05.069939       1 rke2controlplane_controller.go:681] "Scaling up control plane"
 ... Desired=3 Existing=2

Machine for cp-0 is created

What seems to happen is that the cluster-maxunavailable controller, at 22:58:04 has a wrong vision of the status of the RKE2ControlPlane, and wrongly concludes that all CP nodes are available.

Edited Sep 02, 2025 by Thomas Morin