upgrade issues around RKE2 k8s metrics server
I observed the following while working on !3090 (closed), in job https://gitlab.com/sylva-projects/sylva-core/-/jobs/8340196863:
- workload cluster namespace deletion was stuck on:
NAME STATUS AGE
rke2-capm3-virt Terminating 137m
Namespace 'rke2-capm3-virt' deletion did not complete (.status below)
conditions:
- lastTransitionTime: "2024-11-13T00:21:10Z"
message: 'Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: stale GroupVersion discovery: metrics.k8s.io/v1beta1'
reason: DiscoveryFailed
status: "True"
type: NamespaceDeletionDiscoveryFailure
-
this is due to the kube-system/rke2-metrics-server Service having no endpoint
-
this is because the selector of the service does not match the pod labels:
-
selector is:
selector: app: rke2-metrics-server app.kubernetes.io/instance: rke2-metrics-server app.kubernetes.io/name: rke2-metrics-serverthe selector was last updated at 2024-11-12T23:06:26Z (rke2-metrics-server chart version rke2-metrics-server-3.12.003)
-
labels of the only rke2-metrics-server pod:
labels: app: rke2-metrics-server pod-template-hash: 544c8c66fc release: rke2-metrics-serverthe
app.kubernetes.io/instanceandapp.kubernetes.io/namelabels are not here, so the selector can't matchthis pod is from the previous version of rke2-metrics-server (Deployment has the
chart: rke2-metrics-server-2.11.100-build2023051513label)
-
-
I observed that the upgrade of the rke2-metrics-server had failed -- in logs of
kube-system/helm-install-rke2-metrics-server-276wjpod:
Release status is 'failed' and failure policy is 'abort', not 'reinstall'; waiting for operator intervention
Summary:
- the Helm upgrade failed (I don't know the reason)
- the rke2-metrics-server was updated with a new pod selector with labels that pods of previous release do not have
- the rke2-metrics-server Deployment was not updated, and hence rke2-metrics-server pods still don't have the new labels
Possibilities for resolution:
- understand why the Helm release status is
failed - upstream fix (the change of selector could and should be done in a way that does not disrupt the service)
- something to live-adjust the pod labels, or the service pod selector, to allow a smoother transition (Kyverno policy)