capm3 - unwanted node rolling update on workload clusters during mgmt cluster update

In some circumstances an unwanted node rolling update will be triggered on capm3 workload clusters when the OS images served by the mgmt cluster are changed, under some conditions (see below).

This issue is common to Sylva 1.3 and Sylva "pre-1.4" main.

Typical summary example:

  • workload clusters use a specific image (ubuntu-noble-hardened-rke2-1-30-9 or an OS image selector)
  • the mgmt cluster cluster is updated and for the same image key, a newer image is now provided (no change of Kubernetes patch version, only base OS changes)

Looking more closely, the conditions are not exactly the same for 1.3 and main:

  • what is common:

    • an "image key" or "OS image selector" do not directly indicate which OS image is deployed (they do not include diskimage builder information), it's only a "loose" indication
    • information about which OS images are served by the mgmt cluster is taken into account to determine the exact image to use (it's SHA sum) from the "loose" indication formulated by the image key or OS image selector
  • in Sylva 1.3 the problematic scenario is the following:

    • a workload cluster has image_key: foo
    • the mgmt cluster is updated and the OS image for foo still exist in the newer settings, but with a different content
      • this can arise for instance if diskimagebuilder version was incremented to incorporate base OS changes without a change of Kubernetes version
      • if the Kubernetes version changes, given how we typically choose image keys (e.g. ubuntu-jammy-plain-rke2-1-29-13) the image key will differ and the scenario here does not apply
    • when the os-images-info unit of the sylva-units mgmt cluster Helm release is reconciled, it will produce an updated os-images-info ConfigMap
    • this ConfigMap is copied at once in workload cluster namespaces (by a Kyverno policy) under the kyverno-cloned-os-images-info-capm3 name
    • this kyverno-cloned-os-images-info-capm3 ConfigMap is used as input to the cluster unit (valuesFrom of the HelmRelease of sylva-capi-cluster)
    • at the next periodic reconciliation of the cluster HelmRelease the new content of the kyverno-cloned-os-images-info-capm3 will be used, resulting in the creation of new Metal3MachineTemplates pointing to the new image (URL unchanged, typically, but SHA256 sum will be changed)
    • this will trigger a node rolling update
  • in Sylva main the problematic scenario is:

    • a workload cluster has a given OS image selector for its cluster
    • it matches a given image X
    • the mgmt cluster is updated and similarly as above, there's a newer image for key X (e.g. X was ubuntu-noble-hardened-rke2-1-31-5 and after the update, the image for ubuntu-noble-hardened-rke2-1-31-5 is still here but it has base OS updates) or a different OS image Y now matches the OS selector (I don't think we have a practical case that would do this today, but given the OS image selector framework flexibility, it could occur)
    • the new image X and/or the new image Y will be served by os-image-server
    • os-image-server will produce an updated capm3-os-image-server-os-images-info ConfigMap (os-image-server namespace)
    • this ConfigMap will be cloned as kyverno-cloned-os-images-info-capm3 ConfigMap in all workload cluster namespaces
    • at the next reconciliation of the cluster HelmRelease the new information will be used, trigerring an update of Metal3Machine template (with either a different image Y with a fully different URL and checksum, or a newer image X with a new sha sum and URL with only the included sylva diskimage-builder version changing)
    • a node rolling update is triggered

This issue is a question of lifecycle "coupling" between mgmt and workload clusters.

The solution seems to be to decorrelate two things:

  • producing os-images-info for relevant images based only on workload cluster data (similarly as what we do for openstack)
  • having the information in a workload cluster context about which OS images are served by the mgmt cluster (unambiguously identified by the sha sum, since the "image key" is ambiguous)

🗒️ a practical workaround around this problem is to put the workload clusters in pause during the mgmt cluster update

/cc @feleouet @cristian.manda @mihai.zaharia @mederic.deverdilhac @rletrocquer

Assignee Loading
Time tracking Loading