CAPI delete and drain first machine of the controlplane at the end of the bootstrap phase.
I've noticed this behaviour in my dev environment. During the bootstrap phase, at the end of the creation of the last machine in the controlplane (HA mode), capi drains and deletes the first machine of the controlplane and recreates it.
ubuntu@bootstrap-bms:~/bootstrap$ kubectl get machines -A
NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
sylva-system management-cluster-control-plane-jt6pd management-cluster management-cluster-cp-45fad8958c-fc85s openstack:///df8d7970-6780-4d3c-b8e4-79f2a5887ecd Running 5m13s v1.28.12-rc4+rke2r1
sylva-system management-cluster-control-plane-npbr9 management-cluster Provisioning 7s v1.28.12-rc4+rke2r1
sylva-system management-cluster-control-plane-w2cvp management-cluster management-cluster-cp-45fad8958c-v4s89 openstack:///3f8e932d-5475-48ea-ad82-a63e1f148a35 Running 13m v1.28.12-rc4+rke2r1
ubuntu@bootstrap-bms:~/bootstrap$ kubectl get machines -A
NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
sylva-system management-cluster-control-plane-c8kb6 management-cluster management-cluster-cp-45fad8958c-tjnll openstack:///5b0407ff-18c3-4fe7-96e6-b77ba876c406 Running 10m v1.28.12-rc4+rke2r1
sylva-system management-cluster-control-plane-jt6pd management-cluster management-cluster-cp-45fad8958c-fc85s openstack:///df8d7970-6780-4d3c-b8e4-79f2a5887ecd Running 19m v1.28.12-rc4+rke2r1
sylva-system management-cluster-control-plane-npbr9 management-cluster management-cluster-cp-45fad8958c-g7wb8 openstack:///47d613f7-aeb9-4e59-aa9c-910b44929481 Running 14m v1.28.12-rc4+rke2r1
I0725 14:15:06.945751 1 machine_controller.go:362] "Draining node" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="sylva-system/management-cluster-control-plane-w2cvp" namespace="sylva-system" name="management-cluster-control-plane-w2
cvp" reconcileID="edd28b73-96cc-436e-aa9b-d68605518c4c" RKE2ControlPlane="sylva-system/management-cluster-control-plane" Cluster="sylva-system/management-cluster" Node="management-cluster-cp-45fad8958c-v4s89"
E0725 14:15:07.046853 1 machine_controller.go:652] "WARNING: ignoring DaemonSet-managed Pods: calico-system/calico-node-crscz, metallb-system/metallb-speaker-q25xb\n" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="sylva-system/manage
ment-cluster-control-plane-w2cvp" namespace="sylva-system" name="management-cluster-control-plane-w2cvp" reconcileID="edd28b73-96cc-436e-aa9b-d68605518c4c" RKE2ControlPlane="sylva-system/management-cluster-control-plane" Cluster="sylva-system/management-cluster" Node="management-cl
uster-cp-45fad8958c-v4s89"
It seems that we also have the same behaviour in CI, since I found similar logs mentioning nodes draining on the bootstrap capi controller.
After investigating the issue, it seems that the problem is due to canReach https://gitlab.com/sylva-projects/sylva-core/-/blob/main/charts/sylva-units/values.yaml?ref_type=heads#L6713 which, in the case of CAPO/RKE2, takes the value of the IP allocated via heatOperator.
The problem is that the value of the IP allocated by heat is not known by sylva-units chart at the beginning. The value of canReach is then 55.55.55.55 (default value of cluster_virtual_ip https://gitlab.com/sylva-projects/sylva-core/-/blob/main/charts/sylva-units/values.yaml?ref_type=heads#L6283 ).
The s-c-c chart will then be installed with this default value and the capo VMs will start to build.
After reconciling sylva-units (periodically or not), the value of canReach will change and the IP allocated by heat-operator will be taken into account, which will cause an upgrade of the s-c-c chart and therefore a rolling-upgrade of the capo VMs ( which often have not yet been fully installed ). \