RKE2ControlPlane update blocked by Kyverno failure

(in RKE2 CAPI provider logs)

E1009 14:01:29.731911       1 controller.go:329] "Reconciler error" err="
failed to add finalizer:
  failed to patch RKE2ControlPlane rke2-capm3-virt/wc-1488324504-rke2-capm3-virt-control-plane:
  admission webhook \"validate.kyverno.svc-fail\" denied the request:
  resource rke2controlplanes not found in group controlplane.cluster.x-k8s.io/v1beta1"
 controller="rke2controlplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="RKE2ControlPlane" RKE2ControlPlane="rke2-capm3-virt/wc-1488324504-rke2-capm3-virt-control-plane" namespace="rke2-capm3-virt" name="wc-1488324504-rke2-capm3-virt-control-plane" reconcileID="ca906958-4956-4fdb-8971-f24b61ea6c72"

The result is that the RKE2ControlPlane update is then stuck (CP nodes remaining in 1.28):

NAME                                              STATUS   ROLES                       AGE   VERSION
mgmt-1488324504-rke2-capm3-virt-management-cp-0   Ready    control-plane,etcd,master   89m   v1.28.8+rke2r1
mgmt-1488324504-rke2-capm3-virt-management-cp-1   Ready    control-plane,etcd,master   95m   v1.28.8+rke2r1
mgmt-1488324504-rke2-capm3-virt-management-cp-2   Ready    control-plane,etcd,master   84m   v1.28.8+rke2r1 
mgmt-1488324504-rke2-capm3-virt-management-md-0   Ready    <none>                      10m   v1.29.8+rke2r1

the status of the RKE2ControlPlane resources is inconsistent (Ready conditions is false but spec.ready is true and readyReplicas==replicas):

  status:
    availableServerIPs:
    - 192.168.100.2
    conditions:
    - lastTransitionTime: "2024-10-09T14:13:59Z"
      message: Rolling 3 replicas with outdated spec (0 replicas up to date)
      reason: RollingUpdateInProgress
      severity: Warning
      status: "False"  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      type: Ready
    - lastTransitionTime: "2024-10-09T13:10:05Z"
      status: "True"
      type: Available
    - lastTransitionTime: "2024-10-09T13:10:05Z"
      status: "True"
      type: CertificatesAvailable
    - lastTransitionTime: "2024-10-09T13:10:05Z"
      status: "True"
      type: ControlPlaneComponentsHealthy
    - lastTransitionTime: "2024-10-09T13:10:06Z"
      status: "True"
      type: MachinesReady
    - lastTransitionTime: "2024-10-09T14:13:29Z"
      message: Rolling 3 replicas with outdated spec (0 replicas up to date)
      reason: RollingUpdateInProgress
      severity: Warning
      status: "False"
      type: MachinesSpecUpToDate
    - lastTransitionTime: "2024-10-09T13:10:06Z"
      status: "True"
      type: Resized
    initialized: true
    observedGeneration: 2
    ready: true  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    readyReplicas: 3   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    replicas: 3  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

The cause seems to be the following:

  • (kyverno is updated first)
  • cabpr is updated introducing a new apiVersion
  • cluster unit is updated, triggering activity in the RKE2 provider controller
  • ... but kyverno Group-Version-Kind cache isn't up to date yet for the new apiVersion and raises the error above
  • the cluster-machines-ready

This was on a run on MR !2959 (merged) (https://gitlab.com/sylva-projects/sylva-core/-/jobs/8033477205) so I initially suspected the issue was due to newer RKE2 provider 0.7.x, but I also noticed the same symptom on a pipeline upgrading from 1.1.1 to main (https://gitlab.com/sylva-projects/sylva-core/-/jobs/8066308435):

NAME                                            STATUS     ROLES                       AGE   VERSION
mgmt-1492463253-rke2-capo-cp-c0beec52cd-5slxn   NotReady   <none>                      11s   v1.29.8+rke2r1
mgmt-1492463253-rke2-capo-cp-f07ae118f4-k297x   Ready      control-plane,etcd,master   79m   v1.28.8+rke2r1
mgmt-1492463253-rke2-capo-cp-f07ae118f4-pqxtd   Ready      control-plane,etcd,master   85m   v1.28.8+rke2r1
mgmt-1492463253-rke2-capo-cp-f07ae118f4-ztd94   Ready      control-plane,etcd,master   76m   v1.28.8+rke2r1
mgmt-1492463253-rke2-capo-md0-2pb9f-4gb6f       Ready      <none>                      17m   v1.29.8+rke2r1
mgmt-1492463253-rke2-capo-md0-2pb9f-8vw2s       Ready      <none>                      21m   v1.29.8+rke2r1
mgmt-1492463253-rke2-capo-md0-2pb9f-mn4sn       Ready      <none>                      20m   v1.29.8+rke2r1

Conclusions:

  • some results from "1.1.1 to main" pipelines may have mislead us into believing that the CP update was correctly done
  • we need to improve cluster-machines-ready
  • we need to fix this Kyverno issue (in progress in !3078 (merged))
Edited Oct 14, 2024 by Thomas Morin
Assignee Loading
Time tracking Loading