failed Node deletion due to RKE2 finalizer leads to inconsistent Node data

This issue was observed on https://gitlab.com/sylva-projects/sylva-core/-/jobs/7249965679 (rke2 capm3 scheduled pipeline, on update-workload-cluster job), with the final symptom being that the Node for a MachineDeployment Machine (wc-1358068067-rke2-capm3-virt-md0-2v5pw-hfgp5) does not come up in the rolling update.

Investigations lead to observe the following: IP 192.168.100.63 is allocated to the Metal3Machine for the MD, but the Calico node on the MD see this IP as used by the cp-2 node.

Initial investigation

via the node_logs for this Machine wc-1358068067-rke2-capm3-virt-md0-2v5pw-hfgp5, we see the calico-node container logs, with this error:

2024-07-02T23:43:06.61706998Z stdout F 2024-07-02 23:43:06.616 [INFO][9] startup/reachaddr.go 47: Auto-detected address by connecting to remote Destination="192.168.100.2" IP=192.168.100.63
2024-07-02T23:43:06.617346575Z stdout F 2024-07-02 23:43:06.617 [INFO][9] startup/autodetection_methods.go 143: Using autodetected IPv4 address 192.168.100.63/24, detected by connecting to 192.168.100.2
2024-07-02T23:43:06.617411488Z stdout F 2024-07-02 23:43:06.617 [INFO][9] startup/startup.go 579: Node IPv4 changed, will check for conflicts
2024-07-02T23:43:06.619983113Z stdout F 2024-07-02 23:43:06.619 [WARNING][9] startup/startup.go 1016:
  Calico node 'wc-1358068067-rke2-capm3-virt-workload-cp-2' is already using the IPv4 address 192.168.100.63.
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-02T23:43:06.620033529Z stdout F 2024-07-02 23:43:06.619 [INFO][9] startup/startup.go 409: Clearing out-of-date IPv4 address from this node IP="192.168.100.63/24"
2024-07-02T23:43:06.628029195Z stdout F 2024-07-02 23:43:06.627 [WARNING][9] startup/utils.go 48: Terminating
2024-07-02T23:43:06.630589177Z stderr F Calico node failed to start

192.168.100.63 seems to be allocated to the MD node:

$ grep -rns "done allocating addresses.*192.168.100.63"
capm3-system/capm3-controller-manager-d9d99c9c6-m7mw8/logs.txt:3367:I0702 23:11:33.562163       1 metal3data_manager.go:394] controllers/Metal3Data/Metal3Data-controller "msg"="done allocating addresses" "addresses"={"wc-1358068067-rke2-capm3-virt-primary-pool":
{"Address":"192.168.100.63","Prefix":24,"Gateway":"192.168.100.1"},"wc-1358068067-rke2-capm3-virt-provisioning-pool":
^^^^^^^^^^^^^^^^^^^^^^^^^^
{"Address":"192.168.10.63","Prefix":24,"Gateway":"192.168.10.1"}} "cluster"="wc-1358068067-rke2-capm3-virt" "metal3-data"={"Namespace":"rke2-capm3-virt","Name":
"wc-1358068067-rke2-capm3-virt-md-metadata-md0-6a01fca5f8-0"} "requeue"=false
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

but Nodes.yaml tells us the following about IP addresses:

$grep -E '192.168.100.6|^    name:|^  kind:|internal-ip: 1' Nodes.yaml
...
  kind: Node
      alpha.kubernetes.io/provided-node-ip: 192.168.100.61
      etcd.rke2.cattle.io/node-address: 192.168.100.61
      projectcalico.org/IPv4Address: 192.168.100.61/24
      rke2.io/internal-ip: 192.168.100.61
    name: wc-1358068067-rke2-capm3-virt-workload-cp-0
    - address: 192.168.100.61
...
  kind: Node
      alpha.kubernetes.io/provided-node-ip: 192.168.100.60
      etcd.rke2.cattle.io/node-address: 192.168.100.60
      projectcalico.org/IPv4Address: 192.168.100.60/24
      rke2.io/internal-ip: 192.168.100.60
    name: wc-1358068067-rke2-capm3-virt-workload-cp-1
    - address: 192.168.100.60
...
  kind: Node
      alpha.kubernetes.io/provided-node-ip: 192.168.100.62
      etcd.rke2.cattle.io/node-address: 192.168.100.63   <<<<<<<<<<<
      projectcalico.org/IPv4Address: 192.168.100.63/24  <<<<<<<<<<
      rke2.io/internal-ip: 192.168.100.62
    name: wc-1358068067-rke2-capm3-virt-workload-cp-2  <<<<<<<<<<<<<
    - address: 192.168.100.62
...
  kind: Node
      alpha.kubernetes.io/provided-node-ip: 192.168.100.63
      rke2.io/internal-ip: 192.168.100.63
    name: wc-1358068067-rke2-capm3-virt-workload-md-0
    - address: 192.168.100.63

So apparently, 192.168.100.63 is also present on the cp-2 node for etcd.rke2.cattle.io/node-address and projectcalico.org/IPv4Address... although it's IP is 192.168.100.62 ...

~~Looking at the node logs of cp-2 node~~ (EDIT: this isn't relevant, this is the server for the new cp-2, not for the old one, see comments below):
- most things relate to 192.168.100.62 (it's legitimate IP)
- there's only one occurrence of 192.168.100.63 in metallb speaker logs (cannot assign requested address)
- the analysis below leads me to think that the reason for this metallb error has the same cause as the calico-node error message on our MD node

Epilogue

In fact, when we have the Calico node 'wc-1358068067-rke2-capm3-virt-workload-cp-2' is already using the IPv4 address 192.168.100.63 it is because the wc-1358068067-rke2-capm3-virt-workload-cp-2 Node has the projectcalico.org/IPv4Address: 192.168.100.63/24 annotation.

However, what is interesting is that the wc-1358068067-rke2-capm3-virt-workload-cp-2 Node is in a funky state: it actually is in a pending deletion state.

  kind: Node
    creationTimestamp: "2024-07-02T21:53:35Z"
        ^^^^^ this show that the node we're
          looking at is the Node for the previous infra machine for cp-2 !
    deletionTimestamp: "2024-07-02T23:06:47Z"
        ^^^^^ this is the date at which _the old Node_ was removed
              (matches the creationTimestamp of the new Machine)
    finalizers:
    - wrangler.cattle.io/cisnetworkpolicy-node 
       ^^^^^^^^^^ this is a finalizer set by RKE2 (more on this one below...)
    name: wc-1358068067-rke2-capm3-virt-workload-cp-2

So the issue we have is that:

reminder: we are in a capm3 context, where the node names are derived from the BMH names, which implies that when a Node is replaced during a CAPI node rolling update, **the new node will have the same name
here we see a case where Nodes aren't deleted properly:
- this seems to be due to the wrangler.cattle.io/cisnetworkpolicy-node finalizer (the only finalizer left on the wc-1358068067-rke2-capm3-virt-workload-cp-2 Node)
- this is tracked in RKE2 upstream issue https://github.com/rancher/rke2/issues/1895
- this is an issue that we had seen in the past, was solved for us thanks to the move to RKE2 1.28.9, and that now pops up again since we had to downgrade (!2509 (merged))
these old Nodes remain and when new Machines/servers come up to replace the old Nodes, instead of creating new Nodes, they find existing ones with annotations and/or statuses that are wrong/misleading

Edited Jul 04, 2024 by Thomas Morin