Failed node deletion prevents reliable recreation of Node
This issue is similar to #1421 (closed), which lead us to initially conflate the two issues (see #1421 (comment 1983485800)).
With @feleouet we finally concluded that in , the issue is that:
- CAPI controller fails to delete the Node (code https://github.com/kubernetes-sigs/cluster-api/blob/44fe37a148a99ed0842982ddf8df7ca42bea98c8/internal/controllers/machine/machine_controller.go#L480): (events.log)
2024-07-03T21:40:17Z 2024-07-03T21:40:17Z Machine sylva-system mgmt-1359687528-rke2-capm3-virt-control-plane-vjkwz 1
FailedDeleteNode error deleting Machine's node:
error deleting node mgmt-1359687528-rke2-capm3-virt-management-cp-0:
client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
- no requeue is attempted because Machine.spec.NodeDeletionTimeout is left at its default value of 10s and the counter starts at the machineDeletion time, and the Node deletion only happens much later than 10s after that because we have the full node drain before
- the recreation of the node (with the same name because this is on capm3) then fails because the Node exists
In fact (at least with capm3, because in Sylva Node names are reused) it does not seem safe at all to have machine.spec.NodeDeletionTimeout to 10s.
We propose to set machine.spec.NodeDeletionTimeout to zero (i.e. unlimited attempts/requeues) to ensure that if Node deletion fails, the deployment is stuck and goes no further.
Edited by Thomas Morin