Failed node deletion prevents reliable recreation of Node

This issue is similar to #1421 (closed), which lead us to initially conflate the two issues (see #1421 (comment 1983485800)).

With @feleouet we finally concluded that in , the issue is that:

CAPI controller fails to delete the Node (code https://github.com/kubernetes-sigs/cluster-api/blob/44fe37a148a99ed0842982ddf8df7ca42bea98c8/internal/controllers/machine/machine_controller.go#L480): (events.log)

2024-07-03T21:40:17Z	2024-07-03T21:40:17Z	Machine	sylva-system	mgmt-1359687528-rke2-capm3-virt-control-plane-vjkwz	1	
  FailedDeleteNode	error deleting Machine's node: 
    error deleting node mgmt-1359687528-rke2-capm3-virt-management-cp-0:
      client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline

no requeue is attempted because Machine.spec.NodeDeletionTimeout is left at its default value of 10s and the counter starts at the machineDeletion time, and the Node deletion only happens much later than 10s after that because we have the full node drain before
the recreation of the node (with the same name because this is on capm3) then fails because the Node exists

In fact (at least with capm3, because in Sylva Node names are reused) it does not seem safe at all to have machine.spec.NodeDeletionTimeout to 10s.

We propose to set machine.spec.NodeDeletionTimeout to zero (i.e. unlimited attempts/requeues) to ensure that if Node deletion fails, the deployment is stuck and goes no further.

Edited Jul 11, 2024 by Thomas Morin