upgrade issues around RKE2 k8s metrics server

I observed the following while working on !3090 (closed), in job https://gitlab.com/sylva-projects/sylva-core/-/jobs/8340196863:

  • workload cluster namespace deletion was stuck on:
NAME              STATUS        AGE
rke2-capm3-virt   Terminating   137m
Namespace 'rke2-capm3-virt' deletion did not complete (.status below)
conditions:
  - lastTransitionTime: "2024-11-13T00:21:10Z"
    message: 'Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: stale GroupVersion discovery: metrics.k8s.io/v1beta1'
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure
  • this is due to the kube-system/rke2-metrics-server Service having no endpoint

  • this is because the selector of the service does not match the pod labels:

    • selector is:

      selector:
       app: rke2-metrics-server
       app.kubernetes.io/instance: rke2-metrics-server
       app.kubernetes.io/name: rke2-metrics-server

      the selector was last updated at 2024-11-12T23:06:26Z (rke2-metrics-server chart version rke2-metrics-server-3.12.003)

    • labels of the only rke2-metrics-server pod:

      labels:
        app: rke2-metrics-server
        pod-template-hash: 544c8c66fc
        release: rke2-metrics-server

      the app.kubernetes.io/instance and app.kubernetes.io/name labels are not here, so the selector can't match

      this pod is from the previous version of rke2-metrics-server (Deployment has the chart: rke2-metrics-server-2.11.100-build2023051513 label)

  • I observed that the upgrade of the rke2-metrics-server had failed -- in logs of kube-system/helm-install-rke2-metrics-server-276wj pod:

Release status is 'failed' and failure policy is 'abort', not 'reinstall'; waiting for operator intervention

Summary:

  • the Helm upgrade failed (I don't know the reason)
  • the rke2-metrics-server was updated with a new pod selector with labels that pods of previous release do not have
  • the rke2-metrics-server Deployment was not updated, and hence rke2-metrics-server pods still don't have the new labels

Possibilities for resolution:

  • understand why the Helm release status is failed
  • upstream fix (the change of selector could and should be done in a way that does not disrupt the service)
  • something to live-adjust the pod labels, or the service pod selector, to allow a smoother transition (Kyverno policy)
Assignee Loading
Time tracking Loading