multiple systemd units stopped, breaking the node

This MR was initially titled "libvirt-metal - kube-proxy is not running on md-0 - ( DNS resolution issue ?)", but the problem observed seem to have as its root cause the fact that many systemd units where stopped, completely breaking the node.

This very plausibly relate to sylva-projects/sylva-elements/diskimage-builder#158 (@feleouet observed in the logs of the job of this issue here, that among the stopped units are the .mount units for the various mountpoint, so this issue here would have as a likely effect the symptoms observed in sylva-projects/sylva-elements/diskimage-builder#158.

Initial issue description

Job #11392807286 failed for 949a338a:

The initial deployment is stuck on the deployment of calico helmrelease. But the issue seems not related to calico itself but to the machine itself. (Same issue with kube-proxy)

kube-system       kube-proxy-2j74n                                                             1/1     Running             0          28m   192.168.100.22   mgmt-2045749394-kubeadm-capm3-virt-management-cp-0   <none>           <none>
kube-system       kube-proxy-h2cwg                                                             1/1     Running             0          33m   192.168.100.20   mgmt-2045749394-kubeadm-capm3-virt-management-cp-1   <none>           <none>
kube-system       kube-proxy-s7qh5                                                             0/1     ContainerCreating   0          29m   192.168.100.21   mgmt-2045749394-kubeadm-capm3-virt-management-md-0   <none>           <none>
kube-system       kube-proxy-sr45r                                                             1/1     Running             0          23m   192.168.100.23   mgmt-2045749394-kubeadm-capm3-virt-management-cp-2   <none>           <none>
kube-system       kube-scheduler-mgmt-2045749394-kubeadm-capm3-virt-management-cp-0            1/1     Running             0          28m   192.168.100.22   mgmt-2045749394-kubeadm-capm3-virt-management-cp-0   <none>           <none>
kube-system       kube-scheduler-mgmt-2045749394-kubeadm-capm3-virt-management-cp-1            1/1     Running             0          33m   192.168.100.20   mgmt-2045749394-kubeadm-capm3-virt-management-cp-1   <none>           <none>
kube-system       kube-scheduler-mgmt-2045749394-kubeadm-capm3-virt-management-cp-2            1/1     Running             0          23m   192.168.100.23   mgmt-2045749394-kubeadm-capm3-virt-management-cp-2   <none>           <none>
kube-system       kube-vip-mgmt-2045749394-kubeadm-capm3-virt-management-cp-0                  1/1     Running             0          28m   192.168.100.22   mgmt-2045749394-kubeadm-capm3-virt-management-cp-0   <none>           <none>
kube-system       kube-vip-mgmt-2045749394-kubeadm-capm3-virt-management-cp-1                  1/1     Running             0          33m   192.168.100.20   mgmt-2045749394-kubeadm-capm3-virt-management-cp-1   <none>           <none>
kube-system       kube-vip-mgmt-2045749394-kubeadm-capm3-virt-management-cp-2                  1/1     Running             0          22m   192.168.100.23   mgmt-2045749394-kubeadm-capm3-virt-management-cp-2   <none>           <none>
sylva-system      node-debug-mgmt-2045749394-kubeadm-capm3-virt-management-md-0                0/1     ContainerCreating   0          60s   192.168.100.21   mgmt-2045749394-kubeadm-capm3-virt-management-md-0   <none>           <none>
tigera-operator   tigera-operator-74b758446f-8fdhh                                             0/1     ContainerCreating   0          28m   192.168.100.21   mgmt-2045749394-kubeadm-capm3-virt-management-md-0   <none>           <none>

Tigera-operator events.logs

2025-09-17T20:11:26Z	2025-09-17T20:11:26Z	kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0	Pod	tigera-operator-74b758446f-8fdhh	1	FailedCreatePodSandBox	"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:40845->127.0.0.53:53: read: connection refused"
2025-09-17T20:11:40Z	2025-09-17T20:11:40Z	kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0	Pod	tigera-operator-74b758446f-8fdhh	1	FailedCreatePodSandBox	"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:60324->127.0.0.53:53: read: connection refused"
2025-09-17T20:11:55Z	2025-09-17T20:11:55Z	kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0	Pod	tigera-operator-74b758446f-8fdhh	1	FailedCreatePodSandBox	"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:51207->127.0.0.53:53: read: connection refused"
2025-09-17T20:12:09Z	2025-09-17T20:12:09Z	kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0	Pod	tigera-operator-74b758446f-8fdhh	1	FailedCreatePodSandBox	"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:55991->127.0.0.53:53: read: connection refused"
2025-09-17T20:12:20Z	2025-09-17T20:12:20Z	kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0	Pod	tigera-operator-74b758446f-8fdhh	1	FailedCreatePodSandBox	"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:48222->127.0.0.53:53: read: connection refused"
2025-09-17T20:12:33Z	2025-09-17T20:12:33Z	kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0	Pod	tigera-operator-74b758446f-8fdhh	1	FailedCreatePodSandBox	"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:57079->127.0.0.53:53: read: connection refused"
2025-09-17T20:12:47Z	2025-09-17T20:12:47Z	kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0	Pod	tigera-operator-74b758446f-8fdhh	1	FailedCreatePodSandBox	"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:35915->127.0.0.53:53: read: connection refused"
2025-09-17T20:13:02Z	2025-09-17T20:13:02Z	kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0	Pod	tigera-operator-74b758446f-8fdhh	1	FailedCreatePodSandBox	"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:33284->127.0.0.53:53: read: connection refused"

Same for kube-proxy:

2025-09-17T20:11:56Z	2025-09-17T20:11:56Z	kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0	Pod	kube-proxy-s7qh5	1	FailedCreatePodSandBox	"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:37154->127.0.0.53:53: read: connection refused"
2025-09-17T20:12:07Z	2025-09-17T20:12:07Z	kubelet-mgmt-2045749394-kubeadm-capm3-virt-management-md-0	Pod	kube-proxy-s7qh5	1	FailedCreatePodSandBox	"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:40628->127.0.0.53:53: read: connection refused"
2025-09-17T20:12:22Z	2025-09-17T20:12:44Z	-	Pod	kube-proxy-s7qh5	3	FailedCreatePodSandBox	"(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image ""registry.k8s.io/pause:3.8"": failed to pull image ""registry.k8s.io/pause:3.8"": failed to pull and unpack image ""registry.k8s.io/pause:3.8"": failed to resolve reference ""registry.k8s.io/pause:3.8"": failed to do request: Head ""https://registry.k8s.io/v2/pause/manifests/3.8"": dial tcp: lookup registry.k8s.io on 127.0.0.53:53: read udp 127.0.0.1:59369->127.0.0.53:53: read: connection refused"

We can see some issues on the cloud-init of the machine itself:

2025-09-17 20:10:14 ::  [    7.689677] cloud-init[1338]: >> Installing miniserve for log collection in CI
2025-09-17 20:10:14 ::  [    7.696532] cloud-init[1338]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2025-09-17 20:10:14 ::  [    7.697396] cloud-init[1338]:                                  Dload  Upload   Total   Spent    Left  Speed
2025-09-17 20:10:14 ::  [    7.698229] cloud-init[1338]: 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: github.com
2025-09-17 20:10:14 ::  [    7.699348] cloud-init[1338]: Warning: Problem : timeout. Will retry in 1 seconds. 3 retries left.
2025-09-17 20:10:15 ::  [    8.698541] cloud-init[1338]: 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: github.com
2025-09-17 20:10:15 ::  [    8.699818] cloud-init[1338]: Warning: Problem : timeout. Will retry in 2 seconds. 2 retries left.
2025-09-17 20:10:17 ::  [   10.701141] cloud-init[1338]: 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: github.com
2025-09-17 20:10:17 ::  [   10.701401] cloud-init[1338]: Warning: Problem : timeout. Will retry in 4 seconds. 1 retries left.
2025-09-17 20:10:18 ::  
2025-09-17 20:10:18 ::  Ubuntu 24.04.2 LTS mgmt-2045749394-kubeadm-capm3-virt-management-md-0 ttyS0
2025-09-17 20:10:18 ::  
2025-09-17 20:10:21 ::  mgmt-2045749394-kubeadm-capm3-virt-management-md-0 login: [   14.706423] cloud-init[1338]: 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: github.com
2025-09-17 20:10:21 ::  [   14.706905] cloud-init[1338]: /var/lib/cloud/instance/scripts/runcmd: 77: wget: not found
2025-09-17 20:10:21 ::  [   14.707823] cloud-init[1338]: chmod: cannot access '/usr/local/bin/miniserve': No such file or directory
2025-09-17 20:10:21 ::  [   14.709408] cloud-init[1338]: link /var/log into /opt/dump
2025-09-17 20:10:21 ::  [   14.712271] cloud-init[1338]: ln -sf /var/log /opt/dump/system--var-log
2025-09-17 20:10:21 ::  [   14.715172] cloud-init[1338]: /var/lib/cloud/instance/scripts/runcmd: 93: [[: not found

^ This might be a false lead

Investigation is needed

We already seen the same issue on https://gitlab.com/sylva-projects/sylva-core/-/jobs/11378340997

Priority:medium because it seems to be related only to "libvirt-metal" but priority:medium because it still breaks certain CI jobs.

Edited Oct 02, 2025 by Thomas Morin
Assignee Loading
Time tracking Loading