longhorn: Engine stuck in "stopped" state, prevent volume attach, blocking drain
Since the upgrade to Longhorn 1.7 we have observed in CI pipelines some cases where Longhorn Engines/Replicas seem to be stock in stopped state.
When this happens some pods cannot start and for some units that have PodDisruptionBudgets this results in failing to drain nodes.
job 8559500297
Example in run https://gitlab.com/sylva-projects/sylva-core/-/jobs/8559500297:
-
Machine/mgmt-1575464159-rke2-capm3-virt-control-plane-jsdhkin Deleting state,DrainingSucceededremainsFalse(Nodemgmt-1575464159-rke2-capm3-virt-management-cp-2) - the pod that can't be evicted is
monitoring-pool-0-2 - this is because pod
monitoring-pool-0-1can't start onmgmt-1575464159-rke2-capm3-virt-management-md-0 - which is because:
2024-12-05T18:00:39Z 2024-12-05T19:40:20Z attachdetach-controller- Pod monitoring-pool-0-1 51 FailedAttachVolume "AttachVolume.Attach failed for volume ""pvc-9d02c710-a934-40e1-88aa-59997fadc1e9"" : rpc error: code = Aborted desc = volume pvc-9d02c710-a934-40e1-88aa-59997fadc1e9 is not ready for workloads"
- which seems to be because the Engine for this volume is stopped:
$ grep 9d02c710 Engines.longhorn.io.summary.txt
NAMESPACE NAME DATA ENGINE STATE ...
longhorn-system pvc-9d02c710-a934-40e1-88aa-59997fadc1e9-e-0 v1 stopped ...
(on this job two Engines were stopped - the other one was the one for thanos-compactor PVC)
job 8578325612
In this run the drain of Node mgmt-1578613858-kubeadm-capm3-virt-management-cp-1 was blocked
On this nodes the pods with associated PDBs having zero "allowed disruptions" where having sibling pods with the following errors:
- harbor-postgres-read pods:
2024-12-07T23:00:03Z 2024-12-08T00:42:12Z attachdetach-controller- Pod harbor-postgres-read-0 41 FailedAttachVolume "AttachVolume.Attach failed for volume ""pvc-ba3f0ba7-8e04-4d68-bafb-7d23981c6117"" : rpc error: code = Internal desc = Bad response statusCode [404]. Status [404 Not Found]. Body: [code=Not Found, detail=, message=unable to attach volume pvc-ba3f0ba7-8e04-4d68-bafb-7d23981c6117 to mgmt-1578613858-kubeadm-capm3-virt-management-md-0: node.longhorn.io ""mgmt-1578613858-kubeadm-capm3-virt-management-md-0"" not found] from [http://longhorn-backend:9500/v1/volumes/pvc-ba3f0ba7-8e04-4d68-bafb-7d23981c6117?action=attach]"
- keycloak/postgres-primary-0
2024-12-08T00:42:56Z 2024-12-08T00:42:56Z attachdetach-controller- Pod postgres-primary-0 1 FailedAttachVolume "AttachVolume.Attach failed for volume ""pvc-dec43e40-1ae0-4d9e-a4cf-d9bc8c7e795c"" : rpc error: code = Internal desc = Bad response statusCode [404]. Status [404 Not Found]. Body: [detail=, message=unable to attach volume pvc-dec43e40-1ae0-4d9e-a4cf-d9bc8c7e795c to mgmt-1578613858-kubeadm-capm3-virt-management-md-0: node.longhorn.io ""mgmt-1578613858-kubeadm-capm3-virt-management-md-0"" not found, code=Not Found] from [http://longhorn-backend:9500/v1/volumes/pvc-dec43e40-1ae0-4d9e-a4cf-d9bc8c7e795c?action=attach]"
- minio-monitoring
2024-12-07T23:28:39Z 2024-12-08T00:42:11Z attachdetach-controller- Pod monitoring-pool-0-1 26 FailedAttachVolume "AttachVolume.Attach failed for volume ""pvc-d6db9a22-ee71-4191-ac50-e5640d9e4ac4"" : rpc error: code = Internal desc = Bad response statusCode [404]. Status [404 Not Found]. Body: [code=Not Found, detail=, message=unable to attach volume pvc-d6db9a22-ee71-4191-ac50-e5640d9e4ac4 to mgmt-1578613858-kubeadm-capm3-virt-management-md-0: node.longhorn.io ""mgmt-1578613858-kubeadm-capm3-virt-management-md-0"" not found] from [http://longhorn-backend:9500/v1/volumes/pvc-d6db9a22-ee71-4191-ac50-e5640d9e4ac4?action=attach]"
- thanos/ruler
2024-12-07T23:35:25Z 2024-12-08T00:44:53Z attachdetach-controller- Pod thanos-ruler-1 22 FailedAttachVolume "AttachVolume.Attach failed for volume ""pvc-456e5e1d-0eba-44cd-ba16-95f4647dfa7b"" : rpc error: code = Internal desc = Bad response statusCode [404]. Status [404 Not Found]. Body: [code=Not Found, detail=, message=unable to attach volume pvc-456e5e1d-0eba-44cd-ba16-95f4647dfa7b to mgmt-1578613858-kubeadm-capm3-virt-management-md-0: node.longhorn.io ""mgmt-1578613858-kubeadm-capm3-virt-management-md-0"" not found] from [http://longhorn-backend:9500/v1/volumes/pvc-456e5e1d-0eba-44cd-ba16-95f4647dfa7b?action=attach]"
In this run a total of 15 Engines are stopped (among which the 4 for the PVCs mentioned above)