Update CAPI to v1.8.3, CAPBPR RKE2 bootstrap provider to 0.7.1 (!2961) · Merge requests · Sylva-projects / sylva-core

This MR does essentially two things:

upgrades CAPI to 1.8.x (replacing the MR !2710 (closed) initially created by renovate bot)
upgrades CABPR (CAPI bootstrap/cp provider for RKE2)
- resolve the fact that our current main uses a custom build (!2919 (merged) as a temporary workaround for #1595 (closed))

Upgrading two together was necessary because:

upgrades from 1.1.1 would not work when upgrading CAPI alone
the new of RKE2 bootstrap/cp provider (0.7.1) requires CAPI 1.8.x (to use the new CAPI machine pre-terminate hook, itself necessary to solve #1595 (closed))

💡 Reviewers are warmly encouraged to review this MR one commit at a time.

This MR includes a lot of related changes that were necessary to have everything working:

kube-job bump to align clusterctl version with CAPI version
cope with changes in CAPI/CAPD upstream kustomizations:
- need to use kustomize build instead of kubectl kustomize, the latter giving an error with the new kustomizations
- adapt build-kustomize-units-artifact.py and CI linting accordingly
- fix build-kustomize-units-artifact.py which was silently producing nothing when the kubectl kustomize command used to flatten remote resources was failing
a tool to migrate clusters deployed in Sylva 1.1.1 to the new way of managing etcd certificates used by RKE2 bootstrap/cp provider starting from version 0.3.x (Sylva 1.1.1 had version 0.2.7)
dependency adjustments around the takeover by Flux of MetalLB:
- have it happen before the node rolling update
- ensure metallb-rke2-chart-cleanup (which removes RKE2 metallb HelmChart resources) comes after cluster-prevent-rke2-helmcharts-calico-metallb (which will prevent recreation of this resource)
use a custom build of CAPI RKE2 control plane provider: we were hoping that 0.7.1 would solve all the issues we had, but we still run into #1863 (closed) with it, so this MR relies on a custom build which skips etcd membership removal on node teardown (https://github.com/rancher/cluster-api-provider-rke2/blob/main/controlplane/internal/controllers/rke2controlplane_controller.go#L1034-L1036)

It also includes things that were helpful to troubleshoot CI runs, in particular:

more resources covered by debug-on-exit: validation and mutating webhooks, more CAPI secrets, RKE2 HelmCharts.helm.cattle.io.
an evolution of debug-on-exit that runs data collection commands directly on the remote nodes, with kubectl debug nodes (depends on sylva-projects/sylva-elements/ci-tooling/ci-deployment-values!128 (merged))

Issues related to this MR:

Closes #1595 (closed)
Related to #1687 (closed) (initially I hoped this MR would close it but see "Currently seen issues" below and see #1863 (closed))
Improvement for #1857 (closed) (avoids the rke2-server-metrics downtime during the upgrade of this chart due to mismatch of pod labels vs new service pod selector)
issue #1863 (closed) (worked around in this MR via the use of a custom build)

Related MR:

!2958 (merged) for capm3 (ideally 1.8.x version of CAPM3 should land together to fix the advertised compatibility matrix, and we plan to align before releasing Sylva 1.2, but we can merge independently as long as CI passes)
!3296 (closed) in which this MR was tested for fresh installs

On pipelines testing upgrades from Sylva 1.1.1, on capo, we have observed increased times for the CAPI node rolling update (80min instead of 40min):

it might be due to an OpenStack infra issue, and this is currently worked around by increasing gitlab CI job timeouts
if we keep observing this issue, we'll need follow-up work to solve it ...
update: I noticed some nightly capo jobs where cluster unit takes ~65min to finish (e.g. https://sylva-projects.gitlab.io/-/sylva-core/-/jobs/8352279893/artifacts/apply-management-cluster-timeline.html) so maybe this issue is also present in main or due to OpenStack/network infra behavior ?
We can hope that !3306 (merged) will help (it includes a bugfix that precisely relates to OpenStackMachine deletion being delayed by 10 minutes per Machine)

Edited Nov 18, 2024 by Thomas Morin

Update CAPI to v1.8.3, CAPBPR RKE2 bootstrap provider to 0.7.1