Update CAPI to v1.8.3, CAPBPR RKE2 bootstrap provider to 0.7.1
This MR does essentially two things:
- upgrades CAPI to 1.8.x (replacing the MR !2710 (closed) initially created by renovate bot)
- upgrades CABPR (CAPI bootstrap/cp provider for RKE2)
- resolve the fact that our current main uses a custom build (!2919 (merged) as a temporary workaround for #1595 (closed))
Upgrading two together was necessary because:
- upgrades from 1.1.1 would not work when upgrading CAPI alone
- the new of RKE2 bootstrap/cp provider (0.7.1) requires CAPI 1.8.x (to use the new CAPI machine pre-terminate hook, itself necessary to solve #1595 (closed))
This MR includes a lot of related changes that were necessary to have everything working:
- kube-job bump to align clusterctl version with CAPI version
- cope with changes in CAPI/CAPD upstream kustomizations:
- need to use
kustomize buildinstead ofkubectl kustomize, the latter giving an error with the new kustomizations - adapt build-kustomize-units-artifact.py and CI linting accordingly
- fix build-kustomize-units-artifact.py which was silently producing nothing when the
kubectl kustomizecommand used to flatten remote resources was failing
- need to use
- a tool to migrate clusters deployed in Sylva 1.1.1 to the new way of managing etcd certificates used by RKE2 bootstrap/cp provider starting from version 0.3.x (Sylva 1.1.1 had version 0.2.7)
- dependency adjustments around the takeover by Flux of MetalLB:
- have it happen before the node rolling update
- ensure metallb-rke2-chart-cleanup (which removes RKE2 metallb HelmChart resources) comes after cluster-prevent-rke2-helmcharts-calico-metallb (which will prevent recreation of this resource)
- use a custom build of CAPI RKE2 control plane provider: we were hoping that 0.7.1 would solve all the issues we had, but we still run into #1863 (closed) with it, so this MR relies on a custom build which skips etcd membership removal on node teardown (https://github.com/rancher/cluster-api-provider-rke2/blob/main/controlplane/internal/controllers/rke2controlplane_controller.go#L1034-L1036)
It also includes things that were helpful to troubleshoot CI runs, in particular:
- more resources covered by debug-on-exit: validation and mutating webhooks, more CAPI secrets, RKE2 HelmCharts.helm.cattle.io.
- an evolution of debug-on-exit that runs data collection commands directly on the remote nodes, with
kubectl debug nodes(depends on sylva-projects/sylva-elements/ci-tooling/ci-deployment-values!128 (merged))
Issues related to this MR:
- Closes #1595 (closed)
- Related to #1687 (closed) (initially I hoped this MR would close it but see "Currently seen issues" below and see #1863 (closed))
- Improvement for #1857 (closed) (avoids the rke2-server-metrics downtime during the upgrade of this chart due to mismatch of pod labels vs new service pod selector)
- issue #1863 (closed) (worked around in this MR via the use of a custom build)
Related MR:
- !2958 (merged) for capm3 (ideally 1.8.x version of CAPM3 should land together to fix the advertised compatibility matrix, and we plan to align before releasing Sylva 1.2, but we can merge independently as long as CI passes)
- !3296 (closed) in which this MR was tested for fresh installs
Watch out 👀
On pipelines testing upgrades from Sylva 1.1.1, on capo, we have observed increased times for the CAPI node rolling update (80min instead of 40min):
- it might be due to an OpenStack infra issue, and this is currently worked around by increasing gitlab CI job timeouts
- if we keep observing this issue, we'll need follow-up work to solve it ...
-
update: I noticed some nightly capo jobs where
clusterunit takes ~65min to finish (e.g. https://sylva-projects.gitlab.io/-/sylva-core/-/jobs/8352279893/artifacts/apply-management-cluster-timeline.html) so maybe this issue is also present in main or due to OpenStack/network infra behavior ? - We can hope that !3306 (merged) will help (it includes a bugfix that precisely relates to OpenStackMachine deletion being delayed by 10 minutes per Machine)
Edited by Thomas Morin