cluster-maxunavailable fails to hold off the drain of MD machine
(This issue was seen on release-1.4, but is quite likely also present on main.)
Summary: we have one case observed in CI where cluster-maxunavailable controller allows the drain of an MD node although a node rolling update of a CP node is still in progress.
The following was observed in https://gitlab.com/sylva-projects/sylva-core/-/jobs/11209191935:
-
at 22:34 cp-2 was created, it's node was up 7 minutes later (22:41)
-
at 22:45 cp-1 was created, it's node was up 8 minutes later (22:53)
-
when the debug-on-exit is taken, cp-0 Machine is stuck Provisining, and there is no cp-0 node
-
at this point md-0 hasn't been touched, its drain is held off by cluster-maxunavailable, which still logs:
Machine not selected for being drained: There is at least one unavailable machine, can't select any machine for drain and delete (1 CP machine unavailable, 0 MD machines unavailable) -
at 22:57:22, the cluster-maxunavailable still has a valid vision of the control plane (1 CP machine unavailable)
2025-09-01T22:57:52Z DEBUG There is at least one unavailable machine, can't select any machine for drain and delete (1 CP machine unavailable, 0 MD machines unavailable) -
at 22:57:51, the Machine for cp-0 is marked for deletion ; CAPI controller logs:
I0901 22:57:51.431471 1 machine_controller.go:617] "Deleting node" ... Machine="sylva-system/mgmt-2015860940-rke2-capm3-virt-control-plane-tgzsb" ... Node="mgmt-2015860940-rke2-capm3-virt-management-cp-0" -
at 22:57:52 cp-0 Node deletion is finishing
2025-09-01T22:57:52Z INFO Removing finalizer from node {"controller": "providerIDBlacklist", "Node": {"name":"mgmt-2015860940-rke2-capm3-virt-management-cp-0"} -
at 22:58:04 the cluster-maxunavailable controller considers that all CP machines are available
- this is wrong
❗
2025-09-01T22:58:04Z DEBUG unavailable machines ... "control-plane": 0, "machine-deployments": 0}- this coinicides with this log of RKE2 controller:
I0901 22:58:04.719148 ... "Successfully updated RKE2ControlPlane status" .... - this is wrong
-
accordingly the cluster-maxunavailable removes pre-drain hook on Machine for md0
2025-09-01T22:58:04Z INFO Will let machine be drained and deleted ... "machine": "mgmt-2015860940-rke2-capm3-virt-md0-zbk5n-jsb6z"} -
at 22:58:05 :
- rke2 cp controller logs:
I0901 22:58:05.031795 1 workload_cluster_etcd.go:153] "Removed member: mgmt-2015860940-rke2-capm3-virt-management-cp-0-2657246b" ... I0901 22:58:05.069939 1 rke2controlplane_controller.go:681] "Scaling up control plane" ... Desired=3 Existing=2- Machine for cp-0 is created
What seems to happen is that the cluster-maxunavailable controller, at 22:58:04 has a wrong vision of the status of the RKE2ControlPlane, and wrongly concludes that all CP nodes are available.