etcd crashloop during kubeadm cluster upgrades
Summary
Management cluster upgrades are (systematically?) failing, as in this run for example: https://gitlab.com/sylva-projects/sylva-core/-/jobs/7240024420
Machine is stuck in deleting phase:
v1.27.13
sylva-system mgmt-1356446484-kubeadm-capm3-virt-control-plane-f2x9f mgmt-1356446484-kubeadm-capm3-virt mgmt-1356446484-kubeadm-capm3-virt-management-cp-1 metal3://sylva-system/mgmt-1356446484-kubeadm-capm3-virt-management-cp-1/mgmt-1356446484-kubeadm-capm3-virt-cp-0d04c7cb37-72mj2 Running 54m v1.28.9
sylva-system mgmt-1356446484-kubeadm-capm3-virt-control-plane-n925l mgmt-1356446484-kubeadm-capm3-virt mgmt-1356446484-kubeadm-capm3-virt-management-cp-2 metal3://sylva-system/mgmt-1356446484-kubeadm-capm3-virt-management-cp-2/mgmt-1356446484-kubeadm-capm3-virt-cp-0d04c7cb37-4m6vb Deleting 96m v1.28.9
sylva-system mgmt-1356446484-kubeadm-capm3-virt-control-plane-wjz7p mgmt-1356446484-kubeadm-capm3-virt mgmt-1356446484-kubeadm-capm3-virt-management-cp-0 metal3://sylva-system/mgmt-1356446484-kubeadm-capm3-virt-management-cp-0/mgmt-1356446484-kubeadm-capm3-virt-cp-0d04c7cb37-cjpvb Running 96m v1.28.9
sylva-system mgmt-1356446484-kubeadm-capm3-virt-md0-zt7z9-mz44s mgmt-1356446484-kubeadm-capm3-virt mgmt-1356446484-kubeadm-capm3-virt-management-md-0 metal3://sylva-system/mgmt-1356446484-kubeadm-capm3-virt-management-md-0/mgmt-1356446484-kubeadm-capm3-virt-md-md0-f55a7a9b3e-p7fz5 Running 55m v1.28.9
We can see that etcd is in crashloop on that node:
kube-system etcd-mgmt-1356446484-kubeadm-capm3-virt-management-cp-0 1/1 Running 0 115m 192.168.100.23 mgmt-1356446484-kubeadm-capm3-virt-management-cp-0 <none> <none>
kube-system etcd-mgmt-1356446484-kubeadm-capm3-virt-management-cp-1 1/1 Running 0 51m 192.168.100.22 mgmt-1356446484-kubeadm-capm3-virt-management-cp-1 <none> <none>
kube-system etcd-mgmt-1356446484-kubeadm-capm3-virt-management-cp-2 0/1 CrashLoopBackOff 14 (5m2s ago) 122m 192.168.100.20 mgmt-1356446484-kubeadm-capm3-virt-management-cp-2 <none> <none>
kube-system kube-apiserver-mgmt-1356446484-kubeadm-capm3-virt-management-cp-0 1/1 Running 0 115m 192.168.100.23 mgmt-1356446484-kubeadm-capm3-virt-management-cp-0 <none> <none>
kube-system kube-apiserver-mgmt-1356446484-kubeadm-capm3-virt-management-cp-1 1/1 Running 0 51m 192.168.100.22 mgmt-1356446484-kubeadm-capm3-virt-management-cp-1 <none> <none>
kube-system kube-apiserver-mgmt-1356446484-kubeadm-capm3-virt-management-cp-2 0/1 CrashLoopBackOff 14 (3m10s ago) 122m 192.168.100.20 mgmt-1356446484-kubeadm-capm3-virt-management-cp-2 <none> <none>
If we look at journal, we can observe that containerd is failing to start etcd pod:
Jul 1 22:52:33 mgmt-1356446484-kubeadm-capm3-virt-management-cp-2 kubelet[1847]: E0701 22:52:33.025849 1847 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-mgmt-1356446484-kubeadm-capm3-virt-management-cp-2_kube-system(a9d82f838364c64317775eb93db8d333)\"" pod="kube-system/kube-apiserver-mgmt-1356446484-kubeadm-capm3-virt-management-cp-2" podUID="a9d82f838364c64317775eb93db8d333"
Jul 1 22:52:35 mgmt-1356446484-kubeadm-capm3-virt-management-cp-2 containerd[1108]: time="2024-07-01T22:52:35.203853140Z" level=info msg="Portforward for \"661a9ba3e23c34703babcd9bf037d34a5acc52a6c38b62af7dab0b6d82b99289\" port []"
Jul 1 22:52:35 mgmt-1356446484-kubeadm-capm3-virt-management-cp-2 containerd[1108]: time="2024-07-01T22:52:35.203904749Z" level=info msg="Portforward for \"661a9ba3e23c34703babcd9bf037d34a5acc52a6c38b62af7dab0b6d82b99289\" returns URL \"http://127.0.0.1:38431/portforward/RiXFAOex\""
Jul 1 22:52:35 mgmt-1356446484-kubeadm-capm3-virt-management-cp-2 containerd[1108]: time="2024-07-01T22:52:35.206978785Z" level=info msg="Executing port forwarding in network namespace \"host\""
Jul 1 22:52:35 mgmt-1356446484-kubeadm-capm3-virt-management-cp-2 containerd[1108]: E0701 22:52:35.207753 1108 httpstream.go:290] error forwarding port 2379 to pod 661a9ba3e23c34703babcd9bf037d34a5acc52a6c38b62af7dab0b6d82b99289, uid : failed to execute portforward in network namespace "host": failed to connect to localhost:2379 inside namespace "661a9ba3e23c34703babcd9bf037d34a5acc52a6c38b62af7dab0b6d82b99289", IPv4: dial tcp4 127.0.0.1:2379: connect: connection refused IPv6 dial tcp6 [::1]:2379: connect: connection refused
Jul 1 22:52:35 mgmt-1356446484-kubeadm-capm3-virt-management-cp-2 systemd[1]: run-containerd-runc-k8s.io-953f1a817ed29b248c105e496fc0765d6629a8b5aef1d927927eefbf3e23176d-runc.PGzYAB.mount: Deactivated successfully.
This issue is close to what has been described in these upstream bugs:
https://github.com/kubernetes-sigs/cluster-api/issues/4253 https://github.com/kubernetes/kubernetes/issues/99850
But in our case, it look like containerd is attempting to perform port-forward on both IPv4 and IPv6 loopbacks.