multiple systemd units stopped, breaking the node
This MR was initially titled "libvirt-metal - kube-proxy is not running on md-0 - ( DNS resolution issue ?)", but the problem observed seem to have as its root cause the fact that many systemd units where stopped, completely breaking the node.
This very plausibly relate to sylva-projects/sylva-elements/diskimage-builder#158 (@feleouet observed in the logs of the job of this issue here, that among the stopped units are the .mount units for the various mountpoint, so this issue here would have as a likely effect the symptoms observed in sylva-projects/sylva-elements/diskimage-builder#158.
Initial issue description
Job #11392807286 failed for 949a338a:
The initial deployment is stuck on the deployment of calico helmrelease. But the issue seems not related to calico itself but to the machine itself. (Same issue with kube-proxy)
kube-system kube-proxy-2j74n 1/1 Running 0 28m 192.168.100.22 mgmt-2045749394-kubeadm-capm3-virt-management-cp-0 <none> <none>
kube-system kube-proxy-h2cwg 1/1 Running 0 33m 192.168.100.20 mgmt-2045749394-kubeadm-capm3-virt-management-cp-1 <none> <none>
kube-system kube-proxy-s7qh5 0/1 ContainerCreating 0 29m 192.168.100.21 mgmt-2045749394-kubeadm-capm3-virt-management-md-0 <none> <none>
kube-system kube-proxy-sr45r 1/1 Running 0 23m 192.168.100.23 mgmt-2045749394-kubeadm-capm3-virt-management-cp-2 <none> <none>
kube-system kube-scheduler-mgmt-2045749394-kubeadm-capm3-virt-management-cp-0 1/1 Running 0 28m 192.168.100.22 mgmt-2045749394-kubeadm-capm3-virt-management-cp-0 <none> <none>
kube-system kube-scheduler-mgmt-2045749394-kubeadm-capm3-virt-management-cp-1 1/1 Running 0 33m 192.168.100.20 mgmt-2045749394-kubeadm-capm3-virt-management-cp-1 <none> <none>
kube-system kube-scheduler-mgmt-2045749394-kubeadm-capm3-virt-management-cp-2 1/1 Running 0 23m 192.168.100.23 mgmt-2045749394-kubeadm-capm3-virt-management-cp-2 <none> <none>
kube-system kube-vip-mgmt-2045749394-kubeadm-capm3-virt-management-cp-0 1/1 Running 0 28m 192.168.100.22 mgmt-2045749394-kubeadm-capm3-virt-management-cp-0 <none> <none>
kube-system kube-vip-mgmt-2045749394-kubeadm-capm3-virt-management-cp-1 1/1 Running 0 33m 192.168.100.20 mgmt-2045749394-kubeadm-capm3-virt-management-cp-1 <none> <none>
kube-system kube-vip-mgmt-2045749394-kubeadm-capm3-virt-management-cp-2 1/1 Running 0 22m 192.168.100.23 mgmt-2045749394-kubeadm-capm3-virt-management-cp-2 <none> <none>
sylva-system node-debug-mgmt-2045749394-kubeadm-capm3-virt-management-md-0 0/1 ContainerCreating 0 60s 192.168.100.21 mgmt-2045749394-kubeadm-capm3-virt-management-md-0 <none> <none>
tigera-operator tigera-operator-74b758446f-8fdhh 0/1 ContainerCreating 0 28m 192.168.100.21 mgmt-2045749394-kubeadm-capm3-virt-management-md-0 <none> <none>
Tigera-operator events.logs
2025-09-17T20:11:26Z 2025-09-17T20:11:26Z kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0 Pod tigera-operator-74b758446f-8fdhh 1 FailedCreatePodSandBox "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:40845->127.0.0.53:53: read: connection refused"
2025-09-17T20:11:40Z 2025-09-17T20:11:40Z kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0 Pod tigera-operator-74b758446f-8fdhh 1 FailedCreatePodSandBox "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:60324->127.0.0.53:53: read: connection refused"
2025-09-17T20:11:55Z 2025-09-17T20:11:55Z kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0 Pod tigera-operator-74b758446f-8fdhh 1 FailedCreatePodSandBox "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:51207->127.0.0.53:53: read: connection refused"
2025-09-17T20:12:09Z 2025-09-17T20:12:09Z kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0 Pod tigera-operator-74b758446f-8fdhh 1 FailedCreatePodSandBox "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:55991->127.0.0.53:53: read: connection refused"
2025-09-17T20:12:20Z 2025-09-17T20:12:20Z kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0 Pod tigera-operator-74b758446f-8fdhh 1 FailedCreatePodSandBox "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:48222->127.0.0.53:53: read: connection refused"
2025-09-17T20:12:33Z 2025-09-17T20:12:33Z kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0 Pod tigera-operator-74b758446f-8fdhh 1 FailedCreatePodSandBox "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:57079->127.0.0.53:53: read: connection refused"
2025-09-17T20:12:47Z 2025-09-17T20:12:47Z kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0 Pod tigera-operator-74b758446f-8fdhh 1 FailedCreatePodSandBox "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:35915->127.0.0.53:53: read: connection refused"
2025-09-17T20:13:02Z 2025-09-17T20:13:02Z kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0 Pod tigera-operator-74b758446f-8fdhh 1 FailedCreatePodSandBox "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:33284->127.0.0.53:53: read: connection refused"
Same for kube-proxy:
2025-09-17T20:11:56Z 2025-09-17T20:11:56Z kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0 Pod kube-proxy-s7qh5 1 FailedCreatePodSandBox "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:37154->127.0.0.53:53: read: connection refused"
2025-09-17T20:12:07Z 2025-09-17T20:12:07Z kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0 Pod kube-proxy-s7qh5 1 FailedCreatePodSandBox "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:40628->127.0.0.53:53: read: connection refused"
2025-09-17T20:12:22Z 2025-09-17T20:12:44Z - Pod kube-proxy-s7qh5 3 FailedCreatePodSandBox "(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:59369->127.0.0.53:53: read: connection refused"
We can see some issues on the cloud-init of the machine itself:
2025-09-17 20:10:14 :: [ 7.689677] cloud-init[1338]: >> Installing miniserve for log collection in CI
2025-09-17 20:10:14 :: [ 7.696532] cloud-init[1338]: % Total % Received % Xferd Average Speed Time Time Time Current
2025-09-17 20:10:14 :: [ 7.697396] cloud-init[1338]: Dload Upload Total Spent Left Speed
2025-09-17 20:10:14 :: [ 7.698229] cloud-init[1338]:
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: github.com
2025-09-17 20:10:14 :: [ 7.699348] cloud-init[1338]: Warning: Problem : timeout. Will retry in 1 seconds. 3 retries left.
2025-09-17 20:10:15 :: [ 8.698541] cloud-init[1338]:
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: github.com
2025-09-17 20:10:15 :: [ 8.699818] cloud-init[1338]: Warning: Problem : timeout. Will retry in 2 seconds. 2 retries left.
2025-09-17 20:10:17 :: [ 10.701141] cloud-init[1338]:
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: github.com
2025-09-17 20:10:17 :: [ 10.701401] cloud-init[1338]: Warning: Problem : timeout. Will retry in 4 seconds. 1 retries left.
2025-09-17 20:10:18 ::
2025-09-17 20:10:18 :: Ubuntu 24.04.2 LTS mgmt-2045749394-kubeadm-capm3-virt-management-md-0 ttyS0
2025-09-17 20:10:18 ::
2025-09-17 20:10:21 :: mgmt-2045749394-kubeadm-capm3-virt-management-md-0 login: [ 14.706423] cloud-init[1338]:
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: github.com
2025-09-17 20:10:21 :: [ 14.706905] cloud-init[1338]: /var/lib/cloud/instance/scripts/runcmd: 77: wget: not found
2025-09-17 20:10:21 :: [ 14.707823] cloud-init[1338]: chmod: cannot access '/usr/local/bin/miniserve': No such file or directory
2025-09-17 20:10:21 :: [ 14.709408] cloud-init[1338]: link /var/log into /opt/dump
2025-09-17 20:10:21 :: [ 14.712271] cloud-init[1338]: ln -sf /var/log /opt/dump/system--var-log
2025-09-17 20:10:21 :: [ 14.715172] cloud-init[1338]: /var/lib/cloud/instance/scripts/runcmd: 93: [[: not found
^ This might be a false lead
Investigation is needed
We already seen the same issue on https://gitlab.com/sylva-projects/sylva-core/-/jobs/11378340997
Priority:medium because it seems to be related only to "libvirt-metal" but priority:medium because it still breaks certain CI jobs.