Rolling upgrade failing with machines stuck on SettingProviderIDOnNodeFailed

Summary

This issue has been observed on various CI runs, and became much more frequent while working on !2553 (merged)

!2553 (comment 1986688526)

It is inherent to capi machine lifecycle: machine controller deletes the node prior to delete the machine (and the infra-machine), there is consequently a short period of time where kubelet running on the machine that is being deleted can recreate the node.

In that case, the corresponding node will be left-over once machine will be deleted, and it will prevent the registration of the node corresponding to the new machine, as it won't be able to replace providerID of the node as this field is immutable.

We can forsee 2 solutions to prevent that issue:

Use a kyverno policy to track providerIDs and prevent the re-creation of a node reusing an old ID.
Add pre-commands to delete any existing node node prior to start rke2 or kubeadm.

Edited Aug 23, 2024 by Mathieu Rohon