RKE2 1.27 to 1.28 upgrade prevented by RKE2 trying to update ns annotations during startup and failing
(troubleshooted jointly with @feleouet)
Details at https://github.com/rancher/rke2/issues/5693
Short summary:
- (we already knew that RKE2, on startup of a new node, cannot do API operations that triggers webhooks because its local API server can't reach pods, because kube-proxy is started at the end of the initialization)
- unfortunately RKE2 installs network policies and tries to annotate the namespaces which it already covered, this annotation action trigger webhooks we have in Sylva on Namespace resources (Rancher and Kyverno built-in webhooks)
- this is a problem only for RKE2 upgrades that need to add new network policies, which is the case for 1.27 to 1.28
Implications for Sylva: this prevents upgrading an RKE2 cluster from 1.27 to 1.28.
I found this workaround: if we setup a few annotations on the kube-system namespace, we avoid the code that breaks.
apiVersion: v1
kind: Namespace
metadata:
annotations:
# these ones are already present on 1.27, nothing to do
np.rke2.io: resolved
np.rke2.io/dns: resolved
np.rke2.io/ingress: resolved
# these 3 ones would be added by RKE2 1.28, if we add them manually the problematic code will not run
np.rke2.io/ingress-webhook: resolved
np.rke2.io/metrics-server: resolved
np.rke2.io/snapshot-validation-webhook: resolved
One implication if we do that is that we would need something else to install these network policies.