Cluster garbage collector deletes metal3datatemplate too eagerly
Summary
The cluster garbage collector script deletes metal3datatemplate that have been produced but are not yet used if it runs before the metal3 machine is actually deployed.
This leads to a totally stuck rolling-upgrade without any apparent error.
Steps to reproduce
- Deploy a workload cluster using capm3
- Trigger a rolling update on the workload cluster ( by changes something that will be seen in the
metal3datatemplate, like adding a vlan to an interface ) - run the cluster garbage collector manually
What is the current bug behavior?
When triggering the rolling update, the following metal3datatemplates are created
helm get all cluster -n wl-cluster | yq 'select(.kind == "Metal3DataTemplate") .metadata.name'
wl-cluster-cp-metadata-2f84eaa9d0
---
wl-cluster-md-metadata-md0-317dbafee4
We can see that they actually exist on the cluster
kubectl get -n wl-cluster m3dt
NAME CLUSTER AGE
wl-cluster-cp-metadata-2f84eaa9d0 3d21h
wl-cluster-cp-metadata-3bf3264163 6m34s
wl-cluster-md-metadata-md0-317dbafee4 4m2s <- new md
wl-cluster-md-metadata-md0-f0515c9c9e 52m
After running the cronjob, the new metadata is gone
NAME CLUSTER AGE
wl-cluster-cp-metadata-2f84eaa9d0 3d21h
wl-cluster-md-metadata-md0-f0515c9c9e 61m
What is the expected correct behavior?
the "latest" Metal3DataTemplate should be be deleted since they're used by the current cluster definition.
Relevant logs and/or screenshots
Possible fixes
- Revert !4514 (merged) (short term)
- re-work !4514 (merged) so that we either check if the object is from the latest helm-release generation or find another criteria.
Edited by Jonathan GAYVALLET