longhorn: Engine stuck in "stopped" state, prevent volume attach, blocking drain

Since the upgrade to Longhorn 1.7 we have observed in CI pipelines some cases where Longhorn Engines/Replicas seem to be stock in stopped state.

When this happens some pods cannot start and for some units that have PodDisruptionBudgets this results in failing to drain nodes.

job 8559500297

Example in run https://gitlab.com/sylva-projects/sylva-core/-/jobs/8559500297:

Machine/mgmt-1575464159-rke2-capm3-virt-control-plane-jsdhk in Deleting state, DrainingSucceeded remains False (Node mgmt-1575464159-rke2-capm3-virt-management-cp-2)
the pod that can't be evicted is monitoring-pool-0-2
this is because pod monitoring-pool-0-1 can't start on mgmt-1575464159-rke2-capm3-virt-management-md-0
which is because:

2024-12-05T18:00:39Z	2024-12-05T19:40:20Z	attachdetach-controller-	Pod	monitoring-pool-0-1	51	FailedAttachVolume	"AttachVolume.Attach failed for volume ""pvc-9d02c710-a934-40e1-88aa-59997fadc1e9"" : rpc error: code = Aborted desc = volume pvc-9d02c710-a934-40e1-88aa-59997fadc1e9 is not ready for workloads"

which seems to be because the Engine for this volume is stopped:

$ grep 9d02c710 Engines.longhorn.io.summary.txt
NAMESPACE         NAME                                           DATA ENGINE   STATE     ...
longhorn-system   pvc-9d02c710-a934-40e1-88aa-59997fadc1e9-e-0   v1            stopped   ...

(on this job two Engines were stopped - the other one was the one for thanos-compactor PVC)

job 8578325612

In this run the drain of Node mgmt-1578613858-kubeadm-capm3-virt-management-cp-1 was blocked

On this nodes the pods with associated PDBs having zero "allowed disruptions" where having sibling pods with the following errors:

harbor-postgres-read pods:

2024-12-07T23:00:03Z	2024-12-08T00:42:12Z	attachdetach-controller-	Pod	harbor-postgres-read-0	41	FailedAttachVolume	"AttachVolume.Attach failed for volume ""pvc-ba3f0ba7-8e04-4d68-bafb-7d23981c6117"" : rpc error: code = Internal desc = Bad response statusCode [404]. Status [404 Not Found]. Body: [code=Not Found, detail=, message=unable to attach volume pvc-ba3f0ba7-8e04-4d68-bafb-7d23981c6117 to mgmt-1578613858-kubeadm-capm3-virt-management-md-0: node.longhorn.io ""mgmt-1578613858-kubeadm-capm3-virt-management-md-0"" not found] from [http://longhorn-backend:9500/v1/volumes/pvc-ba3f0ba7-8e04-4d68-bafb-7d23981c6117?action=attach]"

keycloak/postgres-primary-0

2024-12-08T00:42:56Z	2024-12-08T00:42:56Z	attachdetach-controller-	Pod	postgres-primary-0	1	FailedAttachVolume	"AttachVolume.Attach failed for volume ""pvc-dec43e40-1ae0-4d9e-a4cf-d9bc8c7e795c"" : rpc error: code = Internal desc = Bad response statusCode [404]. Status [404 Not Found]. Body: [detail=, message=unable to attach volume pvc-dec43e40-1ae0-4d9e-a4cf-d9bc8c7e795c to mgmt-1578613858-kubeadm-capm3-virt-management-md-0: node.longhorn.io ""mgmt-1578613858-kubeadm-capm3-virt-management-md-0"" not found, code=Not Found] from [http://longhorn-backend:9500/v1/volumes/pvc-dec43e40-1ae0-4d9e-a4cf-d9bc8c7e795c?action=attach]"

minio-monitoring

2024-12-07T23:28:39Z	2024-12-08T00:42:11Z	attachdetach-controller-	Pod	monitoring-pool-0-1	26	FailedAttachVolume	"AttachVolume.Attach failed for volume ""pvc-d6db9a22-ee71-4191-ac50-e5640d9e4ac4"" : rpc error: code = Internal desc = Bad response statusCode [404]. Status [404 Not Found]. Body: [code=Not Found, detail=, message=unable to attach volume pvc-d6db9a22-ee71-4191-ac50-e5640d9e4ac4 to mgmt-1578613858-kubeadm-capm3-virt-management-md-0: node.longhorn.io ""mgmt-1578613858-kubeadm-capm3-virt-management-md-0"" not found] from [http://longhorn-backend:9500/v1/volumes/pvc-d6db9a22-ee71-4191-ac50-e5640d9e4ac4?action=attach]"

thanos/ruler

2024-12-07T23:35:25Z	2024-12-08T00:44:53Z	attachdetach-controller-	Pod	thanos-ruler-1	22	FailedAttachVolume	"AttachVolume.Attach failed for volume ""pvc-456e5e1d-0eba-44cd-ba16-95f4647dfa7b"" : rpc error: code = Internal desc = Bad response statusCode [404]. Status [404 Not Found]. Body: [code=Not Found, detail=, message=unable to attach volume pvc-456e5e1d-0eba-44cd-ba16-95f4647dfa7b to mgmt-1578613858-kubeadm-capm3-virt-management-md-0: node.longhorn.io ""mgmt-1578613858-kubeadm-capm3-virt-management-md-0"" not found] from [http://longhorn-backend:9500/v1/volumes/pvc-456e5e1d-0eba-44cd-ba16-95f4647dfa7b?action=attach]"

In this run a total of 15 Engines are stopped (among which the 4 for the PVCs mentioned above)

Edited Dec 09, 2024 by Thomas Morin