capm3 management cluster rolling update fails because of pods with 'FailedAttachVolume'

Job #6597596587 failed for bb6c373d:

artifacts.zip

This issue seems to occurs since some days: image

and it is quite similar: the rolling update is blocked in the middle because a node cannot be drained:

NAME                                              STATUS                     ROLES                       AGE   VERSION          INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
mgmt-1248282387-rke2-capm3-virt-management-cp-0   Ready                      control-plane,etcd,master   79m   v1.28.8+rke2r1   192.168.100.22   <none>        Ubuntu 22.04.4 LTS   5.15.0-101-generic   containerd://1.7.11-k3s2
mgmt-1248282387-rke2-capm3-virt-management-cp-1   Ready                      control-plane,etcd,master   28m   v1.28.8+rke2r1   192.168.100.23   <none>        Ubuntu 22.04.4 LTS   5.15.0-101-generic   containerd://1.7.11-k3s2
mgmt-1248282387-rke2-capm3-virt-management-cp-2   Ready,SchedulingDisabled   control-plane,etcd,master   84m   v1.28.8+rke2r1   192.168.100.20   <none>        Ubuntu 22.04.4 LTS   5.15.0-101-generic   containerd://1.7.11-k3s2
mgmt-1248282387-rke2-capm3-virt-management-md-0   Ready                      <none>                      31m   v1.28.8+rke2r1   192.168.100.21   <none>        Ubuntu 22.04.4 LTS   5.15.0-101-generic   containerd://1.7.11-k3s2

It's mainly Vault and/or Keycloak's Postgres statefulset which cannot be deleted from drained node (mgmt-1248282387-rke2-capm3-virt-management-cp-2), because a pod in another node fails to initialise. The other node seems to always be the worker node (mgmt-1248282387-rke2-capm3-virt-management-cp-2).

In keycloak ns we can see:

keycloak                           postgres-read-0                                                            1/1     Running             0             75m     100.72.199.155   mgmt-1248282387-rke2-capm3-virt-management-cp-2   <none>           <none>
keycloak                           postgres-read-1                                                            1/1     Running             0             74m     100.72.176.220   mgmt-1248282387-rke2-capm3-virt-management-cp-0   <none>           <none>
keycloak                           postgres-read-2                                                            0/1     ContainerCreating   0             39m     <none>           mgmt-1248282387-rke2-capm3-virt-management-md-0   <none>           <none>

with events:

2024-04-10T21:52:09Z	2024-04-10T21:52:09Z	Pod	postgres-read-2	1	FailedAttachVolume	"AttachVolume.Attach failed for volume ""pvc-2079a1cd-ef76-42dc-85e5-65e36d94749a"" : timed out waiting for external-attacher of driver.longhorn.io CSI driver to attach volume pvc-2079a1cd-ef76-42dc-85e5-65e36d94749a"
2024-04-10T21:52:40Z	2024-04-10T22:15:17Z	Pod	postgres-read-2	18	FailedAttachVolume	"AttachVolume.Attach failed for volume ""pvc-2079a1cd-ef76-42dc-85e5-65e36d94749a"" : rpc error: code = DeadlineExceeded desc = volume pvc-2079a1cd-ef76-42dc-85e5-65e36d94749a failed to attach to node mgmt-1248282387-rke2-capm3-virt-management-md-0 with attachmentID csi-4cc9b551c3f8518b4aa195b73d7e62a9b1265542cc562d3171dee6138151895f"

same for vault

vault                              vault-0                                                                    0/2     Init:0/1            0             38m     <none>           mgmt-1248282387-rke2-capm3-virt-management-md-0   <none>           <none>
vault                              vault-1                                                                    2/2     Running             0             73m     100.72.176.221   mgmt-1248282387-rke2-capm3-virt-management-cp-0   <none>           <none>
vault                              vault-2                                                                    2/2     Running             0             39m     100.72.199.196   mgmt-1248282387-rke2-capm3-virt-management-cp-2   <none>           <none>
2024-04-10T21:52:09Z	2024-04-10T21:52:09Z	Pod	vault-0	1	FailedAttachVolume	"AttachVolume.Attach failed for volume ""pvc-9aaf2737-21e1-4f41-88a8-b08a43c61e5c"" : timed out waiting for external-attacher of driver.longhorn.io CSI driver to attach volume pvc-9aaf2737-21e1-4f41-88a8-b08a43c61e5c"
2024-04-10T21:52:40Z	2024-04-10T22:17:19Z	Pod	vault-0	19	FailedAttachVolume	"AttachVolume.Attach failed for volume ""pvc-9aaf2737-21e1-4f41-88a8-b08a43c61e5c"" : rpc error: code = DeadlineExceeded desc = volume pvc-9aaf2737-21e1-4f41-88a8-b08a43c61e5c failed to attach to node mgmt-1248282387-rke2-capm3-virt-management-md-0 with attachmentID csi-e1cb47c3bb6c5bf7eea011f0566748db425fc29a1fa527dfae02391433669a5e"
Assignee Loading
Time tracking Loading