Add longhorn-pre-disk-check unit (!5783) · Merge requests · Sylva-projects / sylva-core

What does this MR do and why?

This merge request introduces a pre-disk-check unit to ensure that all Longhorn nodes and disks are fully registered, ready, and schedulable before the Longhorn upgrade begins.

During Longhorn 1.9 upgrades, we observed intermittent upgrade failures caused by the Longhorn admission webhook rejecting CRD updates while disks were still syncing.

Typical error observed:

InternalError: failed calling webhook "validator.longhorn.io": no endpoints available for service "longhorn-admission-webhook"

Further inspection of the longhorn-manager logs showed errors like:

Rejected operation ... error="spec and status of disks on node <node-name> are being syncing and please retry later."

This indicates that the upgrade was being triggered while Longhorn nodes were still initializing or syncing disks, leading to transient webhook failures and CrashLoopBackOff states.

Changes Introduced:

Added a pre-diskcheck script/unit that runs before initiating the Longhorn upgrade process.

The script performs:

1. Node registration check:

Waits for all Kubernetes nodes annotated with a Longhorn default-disk configuration to appear in Longhorn’s Node CR list (nodes.longhorn.io).

2. Disk readiness check:

Ensures every disk in each Longhorn node has:

Ready=True

Schedulable=True

3.Acknowledgment of admin-controlled scheduling:

If an administrator has intentionally set allowScheduling: false at the node or disk level, the script detects it and skips readiness validation for that resource.

This avoids blocking upgrades due to operational or maintenance-driven scheduling configurations.

4. Periodic retries with status summary:

Continuously checks and prints disk states until all are ready.

Provides a final disk summary for debugging if readiness is not achieved.

Related reference(s)

Test coverage

CI configuration

Below you can choose test deployment variants to run in this MR's CI.

Click to open to CI configuration

Legend:

Icon	Meaning	Available values
☁️	Infra Provider	`capd`, `capo`, `capm3`
🚀	Bootstrap Provider	`kubeadm` (alias `kadm`), `rke2`, `okd`, `ck8s`
🐧	Node OS	`ubuntu`, `suse`, `na`, `leapmicro`
🛠️	Deployment Options	`light-deploy`, `dev-sources`, `ha`, `misc`, `maxsurge-0`, `logging`, `no-logging`, `openbao`
🎬	Pipeline Scenarios	Available scenario list and description

☁️ capm3 🚀 rke2 🐧 suse
☁️ capm3 🚀 rke2 🎬 sylva-upgrade 🛠️ ha 🐧 suse
☁️ capm3 🚀 kadm 🎬 sylva-upgrade 🛠️ ha 🐧 ubuntu

Global config for deployment pipelines

autorun pipelines
allow failure on pipelines
record sylvactl events

Notes:

Enabling autorun will make deployment pipelines to be run automatically without human interaction
Disabling allow failure will make deployment pipelines mandatory for pipeline success.
if both autorun and allow failure are disabled, deployment pipelines will need manual triggering but will be blocking the pipeline

Be aware: after configuration change, pipeline is not triggered automatically. Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.

Edited Nov 17, 2025 by Ravindra Tanwar

Add longhorn-pre-disk-check unit