Rolling upgrade failing with machines stuck on SettingProviderIDOnNodeFailed
Summary
This issue has been observed on various CI runs, and became much more frequent while working on !2553 (merged)
It is inherent to capi machine lifecycle: machine controller deletes the node prior to delete the machine (and the infra-machine), there is consequently a short period of time where kubelet running on the machine that is being deleted can recreate the node.
In that case, the corresponding node will be left-over once machine will be deleted, and it will prevent the registration of the node corresponding to the new machine, as it won't be able to replace providerID of the node as this field is immutable.
We can forsee 2 solutions to prevent that issue:
- Use a kyverno policy to track providerIDs and prevent the re-creation of a node reusing an old ID.
- Add pre-commands to delete any existing node node prior to start rke2 or kubeadm.
Edited by Mathieu Rohon