cluster-maxunavailable fails to hold off the drain of MD machine

(This issue was seen on release-1.4, but is quite likely also present on main.)

Summary: we have one case observed in CI where cluster-maxunavailable controller allows the drain of an MD node although a node rolling update of a CP node is still in progress.

The following was observed in https://gitlab.com/sylva-projects/sylva-core/-/jobs/11209191935:

  • at 22:34 cp-2 was created, it's node was up 7 minutes later (22:41)

  • at 22:45 cp-1 was created, it's node was up 8 minutes later (22:53)

  • when the debug-on-exit is taken, cp-0 Machine is stuck Provisining, and there is no cp-0 node

  • at this point md-0 hasn't been touched, its drain is held off by cluster-maxunavailable, which still logs:

    Machine not selected for being drained: There is at least one unavailable machine, can't select any machine for drain and delete (1 CP machine unavailable, 0 MD machines unavailable)
  • at 22:57:22, the cluster-maxunavailable still has a valid vision of the control plane (1 CP machine unavailable)

    2025-09-01T22:57:52Z    DEBUG   There is at least one unavailable machine, can't select any machine for drain and delete (1 CP machine unavailable, 0 MD machines unavailable) 
  • at 22:57:51, the Machine for cp-0 is marked for deletion ; CAPI controller logs:

    I0901 22:57:51.431471       1 machine_controller.go:617] 
    "Deleting node" ... Machine="sylva-system/mgmt-2015860940-rke2-capm3-virt-control-plane-tgzsb" ...  
    Node="mgmt-2015860940-rke2-capm3-virt-management-cp-0"
  • at 22:57:52 cp-0 Node deletion is finishing

    2025-09-01T22:57:52Z	INFO	Removing finalizer from node	{"controller": "providerIDBlacklist",  "Node": {"name":"mgmt-2015860940-rke2-capm3-virt-management-cp-0"}
  • at 22:58:04 the cluster-maxunavailable controller considers that all CP machines are available

    • this is wrong ❗
    2025-09-01T22:58:04Z    DEBUG   unavailable machines  ... "control-plane": 0, "machine-deployments": 0}
    • this coinicides with this log of RKE2 controller:
    I0901 22:58:04.719148       ... "Successfully updated RKE2ControlPlane status" ....
  • accordingly the cluster-maxunavailable removes pre-drain hook on Machine for md0

    2025-09-01T22:58:04Z    INFO    Will let machine be drained and deleted ... "machine": "mgmt-2015860940-rke2-capm3-virt-md0-zbk5n-jsb6z"}
  • at 22:58:05 :

    • rke2 cp controller logs:
    I0901 22:58:05.031795       1 workload_cluster_etcd.go:153]
        "Removed member: mgmt-2015860940-rke2-capm3-virt-management-cp-0-2657246b" ...
    I0901 22:58:05.069939       1 rke2controlplane_controller.go:681] "Scaling up control plane"
     ... Desired=3 Existing=2
    • Machine for cp-0 is created

What seems to happen is that the cluster-maxunavailable controller, at 22:58:04 has a wrong vision of the status of the RKE2ControlPlane, and wrongly concludes that all CP nodes are available.

Edited Sep 02, 2025 by Thomas Morin
Assignee Loading
Time tracking Loading