RKE2 failure to instantiate a node (growpart failure, can't find /tmp)

We have many pipelines where the symptom is Nodes being out of disk and evicting pods and where the root cause is the sylva-growparts systemd unit script failing to grow the partition.

# jounalctl -xeu sylva-growparts | cat
...
Oct 02 06:47:53 ubuntu sylva-growparts[659]: /usr/bin/growpart: 714: cannot create /tmp/growpart.zAuylS/pt_update.err: Directory nonexistent
Oct 02 06:47:53 ubuntu sylva-growparts[659]: failed [pt_update:2] pt_update /dev/sda 3
Oct 02 06:47:53 ubuntu sylva-growparts[839]: cat: /tmp/growpart.zAuylS/pt_update.err: No such file or directory
Oct 02 06:47:53 ubuntu sylva-growparts[659]: FAILED: pt_resize failed
...

It is not really clear at this point exactly why it impacts some pipelines and not some others, but this issue has been frequently observed on "capo misc rke2 ubuntu jobs". One reason to only see it on RKE2 pipelines is that the kubeadm runs use a non-hardened OS where partitioning is different/simpler and does not rely on sylva-growparts.

The root issue is very plausibly related to systemd dependencies, involving a lack of ordering which results in the /tmp directory not being available when the sylva-growparts service is ran.

It probably relates strongly to sylva-projects/sylva-elements/diskimage-builder#163 : the fix brought in sylva-projects/sylva-elements/diskimage-builder!518 (merged) for that issue seems to have introduced this regression.

Initial issue description

in two MR pipelines:

  • https://gitlab.com/sylva-projects/sylva-core/-/jobs/11560324311 (rke2 capo-misc fresh-install on the MR dev branch for main)
  • https://gitlab.com/sylva-projects/sylva-core/-/jobs/11560324445 (capo-misc, upgrade from 1.5.x) ... the deploy-mgmt-cluster job fails, stuck at the instantiation of the first node

(Since this happens also on the upgrade-from-1.5.x job, we know that the code of the MR is not the cause.)

Here is what happens...

The node does not come up, because cloud-init is looping, waiting for metallb Helm release to be installed:

(cloud-init-output.log)

...
Waiting for Metallb to be ready...
Error from server (NotFound): services "metallb-webhook-service" not found
Error from server (NotFound): services "metallb-webhook-service" not found
Error from server (NotFound): services "metallb-webhook-service" not found
...
(looping on this)

And I see this surprising error in kubelet logs:

rke2---var-lib-rancher-rke2-agent-logs/kubelet.log:818:E1001 09:58:34.472388    1754 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"helm\" with ImagePullBackOff: \"Back-off pulling image \\\"rancher/klipper-helm:v0.9.8-build20250709\\\":
 ErrImagePull: failed to pull and unpack image \\\"docker.io/rancher/klipper-helm:v0.9.8-build20250709\\\":
 failed to extract layer (application/vnd.oci.image.layer.v1.tar+gzip sha256:fe07684b16b82247c3539ed86a65ff37a76138ec25d380bd80c869a1a4c73236) to overlayfs as \\\"extract-907273175-Yxhx sha256:fd2758d7a50e2b78d275ee7d1c218489f2439084449d895fa17eede6c61ab2c4\\\":
 mount callback failed on /var/lib/rancher/rke2/agent/containerd/tmpmounts/containerd-mount2227071047:
 write /var/lib/rancher/rke2/agent/containerd/tmpmounts/containerd-mount2227071047/usr/lib/libapk.so.2.14.9:
 no space left on device\""
 pod="metallb-system/helm-install-metallb-ctrsf" podUID="30ed2309-b0b0-45f0-b59b-bb270fe3b6ec"

Summary: there is no space to extract the rancher/klipper-helm used by the pod instantiated to install the metallb Helm release

(all these dumps were taken from management-cluster-dump/node_logs)

Edited Oct 02, 2025 by Thomas Morin
Assignee Loading
Time tracking Loading