Cluster garbage collector deletes metal3datatemplate too eagerly

Summary

The cluster garbage collector script deletes metal3datatemplate that have been produced but are not yet used if it runs before the metal3 machine is actually deployed.

This leads to a totally stuck rolling-upgrade without any apparent error.

Steps to reproduce

  • Deploy a workload cluster using capm3
  • Trigger a rolling update on the workload cluster ( by changes something that will be seen in the metal3datatemplate, like adding a vlan to an interface )
  • run the cluster garbage collector manually

What is the current bug behavior?

When triggering the rolling update, the following metal3datatemplates are created

helm get all cluster -n wl-cluster | yq 'select(.kind == "Metal3DataTemplate") .metadata.name'
wl-cluster-cp-metadata-2f84eaa9d0
---
wl-cluster-md-metadata-md0-317dbafee4

We can see that they actually exist on the cluster

kubectl get -n wl-cluster m3dt 
NAME                                    CLUSTER   AGE
wl-cluster-cp-metadata-2f84eaa9d0                 3d21h
wl-cluster-cp-metadata-3bf3264163                 6m34s
wl-cluster-md-metadata-md0-317dbafee4             4m2s <- new md
wl-cluster-md-metadata-md0-f0515c9c9e             52m

After running the cronjob, the new metadata is gone

NAME                                    CLUSTER   AGE
wl-cluster-cp-metadata-2f84eaa9d0                 3d21h
wl-cluster-md-metadata-md0-f0515c9c9e             61m

What is the expected correct behavior?

the "latest" Metal3DataTemplate should be be deleted since they're used by the current cluster definition.

Relevant logs and/or screenshots

Possible fixes

  • Revert !4514 (merged) (short term)
  • re-work !4514 (merged) so that we either check if the object is from the latest helm-release generation or find another criteria.
Edited Sep 08, 2025 by Jonathan GAYVALLET
Assignee Loading
Time tracking Loading