Need to preserve XXXMachineTemplates and (RKE2|Kubeadm)ConfigTemplate resources across Helm release upgrades
Today the following happens with Metal3, when nodeReuse is enabled and a CAPI rolling upgrade is triggered due a Helm release upgrade which changed some parameter in Metal3MachineTemplate:
- (bearing in mind that we generate Metal3MachineTemplate name dynamically, including a hash computed on the content of the resource we'll generate; so if some some parameter in Metal3MachineTemplate changes, we'll generate a new Metal3MachineTemplate)
- assume that, before the upgrade we had a Metal3Machine generated from a Metal3MachineTemplate management-cluster-
- we do a Helm release upgrade, with value changes impacting the content of the Metal3MachineTemplate
- the old Metal3MachineTemplate management-cluster- is removed
- a new Metal3MachineTemplate is created (say management-cluster-)
- because the template name has changed, the CAPI ControlPlane controller (or the MachineDeployment controller), triggers a rolling upgrade
- at some point a Machine will be deleted, resulting in deleting the Metal3Machine
- at this point Metal3 will lookup the name of the Metal3MachineTemplate which had been used to created the Metal3Machine, this is found in annotations:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3Machine
metadata:
annotations:
cluster.x-k8s.io/cloned-from-groupkind: Metal3MachineTemplate.infrastructure.cluster.x-k8s.io
cluster.x-k8s.io/cloned-from-name: management-cluster-cp-<hash1> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
- and then it will try to fetch this resource (to see if Metal3MachineTemplate.spec.nodeReuse is set)
- this will fail because this resource does not exist anymore (Helm deleted it)
- Metal3 falls back to considering that nodeReuse is not enabled (default)
- on Metal3Machine deletion Metal3 needs the Metal3MachineTemplate that was used to create it
- we can't just let Helm delete the Metal3MachineTemplate that it created
- we need something smarter to delete it
⚠ Important edit, @tmmorin 2023-11-29:
We've observed today that CAPI MachineSet controller is also trying to fetch the INFRAMachineTemplate to set the Cluster object as a ownerReference on the machine template (https://github.com/kubernetes-sigs/cluster-api/blob/3cbf341dc32fb29ca1028f4d0fa51969cd3ccb30/internal/controllers/machineset/machineset_controller.go#L1043 and https://github.com/kubernetes-sigs/cluster-api/blob/3cbf341dc32fb29ca1028f4d0fa51969cd3ccb30/internal/controllers/machineset/machineset_controller.go#L237).
So it seems that we need our chart to prevent deletion of all machine templates.
Also, we have the same issue, with same conclusion, on XXXConfigTemplate resources.