RKE2ControlPlane update blocked by Kyverno failure
(in RKE2 CAPI provider logs)
E1009 14:01:29.731911 1 controller.go:329] "Reconciler error" err="
failed to add finalizer:
failed to patch RKE2ControlPlane rke2-capm3-virt/wc-1488324504-rke2-capm3-virt-control-plane:
admission webhook \"validate.kyverno.svc-fail\" denied the request:
resource rke2controlplanes not found in group controlplane.cluster.x-k8s.io/v1beta1"
controller="rke2controlplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="RKE2ControlPlane" RKE2ControlPlane="rke2-capm3-virt/wc-1488324504-rke2-capm3-virt-control-plane" namespace="rke2-capm3-virt" name="wc-1488324504-rke2-capm3-virt-control-plane" reconcileID="ca906958-4956-4fdb-8971-f24b61ea6c72"
The result is that the RKE2ControlPlane update is then stuck (CP nodes remaining in 1.28):
NAME STATUS ROLES AGE VERSION
mgmt-1488324504-rke2-capm3-virt-management-cp-0 Ready control-plane,etcd,master 89m v1.28.8+rke2r1
mgmt-1488324504-rke2-capm3-virt-management-cp-1 Ready control-plane,etcd,master 95m v1.28.8+rke2r1
mgmt-1488324504-rke2-capm3-virt-management-cp-2 Ready control-plane,etcd,master 84m v1.28.8+rke2r1
mgmt-1488324504-rke2-capm3-virt-management-md-0 Ready <none> 10m v1.29.8+rke2r1
the status of the RKE2ControlPlane resources is inconsistent (Ready conditions is false but spec.ready is true and readyReplicas==replicas):
status:
availableServerIPs:
- 192.168.100.2
conditions:
- lastTransitionTime: "2024-10-09T14:13:59Z"
message: Rolling 3 replicas with outdated spec (0 replicas up to date)
reason: RollingUpdateInProgress
severity: Warning
status: "False" <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
type: Ready
- lastTransitionTime: "2024-10-09T13:10:05Z"
status: "True"
type: Available
- lastTransitionTime: "2024-10-09T13:10:05Z"
status: "True"
type: CertificatesAvailable
- lastTransitionTime: "2024-10-09T13:10:05Z"
status: "True"
type: ControlPlaneComponentsHealthy
- lastTransitionTime: "2024-10-09T13:10:06Z"
status: "True"
type: MachinesReady
- lastTransitionTime: "2024-10-09T14:13:29Z"
message: Rolling 3 replicas with outdated spec (0 replicas up to date)
reason: RollingUpdateInProgress
severity: Warning
status: "False"
type: MachinesSpecUpToDate
- lastTransitionTime: "2024-10-09T13:10:06Z"
status: "True"
type: Resized
initialized: true
observedGeneration: 2
ready: true <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
readyReplicas: 3 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
replicas: 3 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
The cause seems to be the following:
- (kyverno is updated first)
- cabpr is updated introducing a new apiVersion
- cluster unit is updated, triggering activity in the RKE2 provider controller
- ... but kyverno Group-Version-Kind cache isn't up to date yet for the new apiVersion and raises the error above
- the
cluster-machines-ready
This was on a run on MR !2959 (merged) (https://gitlab.com/sylva-projects/sylva-core/-/jobs/8033477205) so I initially suspected the issue was due to newer RKE2 provider 0.7.x, but I also noticed the same symptom on a pipeline upgrading from 1.1.1 to main (https://gitlab.com/sylva-projects/sylva-core/-/jobs/8066308435):
NAME STATUS ROLES AGE VERSION
mgmt-1492463253-rke2-capo-cp-c0beec52cd-5slxn NotReady <none> 11s v1.29.8+rke2r1
mgmt-1492463253-rke2-capo-cp-f07ae118f4-k297x Ready control-plane,etcd,master 79m v1.28.8+rke2r1
mgmt-1492463253-rke2-capo-cp-f07ae118f4-pqxtd Ready control-plane,etcd,master 85m v1.28.8+rke2r1
mgmt-1492463253-rke2-capo-cp-f07ae118f4-ztd94 Ready control-plane,etcd,master 76m v1.28.8+rke2r1
mgmt-1492463253-rke2-capo-md0-2pb9f-4gb6f Ready <none> 17m v1.29.8+rke2r1
mgmt-1492463253-rke2-capo-md0-2pb9f-8vw2s Ready <none> 21m v1.29.8+rke2r1
mgmt-1492463253-rke2-capo-md0-2pb9f-mn4sn Ready <none> 20m v1.29.8+rke2r1
Conclusions:
- some results from "1.1.1 to main" pipelines may have mislead us into believing that the CP update was correctly done
- we need to improve cluster-machines-ready
- we need to fix this Kyverno issue (in progress in !3078 (merged))
Edited by Thomas Morin