Add longhorn disk check before upgrade
During Longhorn upgrades to version 1.9, we observed intermittent failures in CI/CD pipelines. While some upgrade runs succeed, others fail with errors related to the Longhorn admission webhook, for example:
InternalError): Internal error occurred: failed calling webhook "validator.longhorn.io": failed to call webhook: Post "https://longhorn-admission-webhook.longhorn-system.svc:9502/v1/webhook/validation?timeout=10s": no endpoints available for service "longhorn-admission-webhook"
2025/08/12 04:37:21.431101 Kustomization/flux-system state changed: Progressing - Reconciliation in progress
2025/08/12 04:37:22.417396 Kustomization/flux-system state changed: ReconciliationFailed - PersistentVolumeClaim/flux-system/flux-sources-pvc dry-run failed (InternalError): Internal error occurred: failed calling webhook "validator.longhorn.io": failed to call webhook: Post "https://longhorn-admission-webhook.longhorn-system.svc:9502/v1/webhook/validation?timeout=10s": no endpoints available for service "longhorn-admission-webhook"
2025/08/12 04:37:54.263438 Command timeout exceeded
Timed-out waiting for the following resources to be ready:
IDENTIFIER STATUS REASON MESSAGE
HelmRelease/sylva-system/longhorn Failed Failed to upgrade after 4 attempt(s)
├┄╴DaemonSet/longhorn-system/longhorn-manager InProgress Available: 0/4
┆ ├┄╴Pod/longhorn-system/longhorn-manager-svg4r Failed Containers in CrashLoop state: longhorn-manager
Logs from longhorn-manager show that disks are still syncing:
Rejected operation: Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Node, namespace: longhorn-system, name: mgmt-1978883962-rke2-capm3-virt-management-cp-2, operation: UPDATE) error="spec and status of disks on node ... are being syncing and please retry later."
While disk syncing is in progress, the Longhorn admission webhook rejects any CRD changes involving those nodes/disks.
This leads to transient errors in upgrade jobs, where some nodes may appear missing or cause upgrade attempts to fail.