Longhorn replica timeout settings are not consistent

Summary

For bare metal deployments, we set replicaReplenishmentWaitInterval=3600 to let 1 hour for a node to go back and reuse its replicas (in case of rolling update, or temporary node failure).

But we observe that 30 minutes after the node is gone, the missing replicas are rebuilt.

It is due to the staleReplicaTimeout parameter of the Storage Class, 30 minutes by default. And this setting is not tunable in the Longhorn helm chart for the default longhorn class.

related references

https://longhorn.io/docs/1.8.1/references/settings/#replica-replenishment-wait-interval https://longhorn.io/docs/1.8.1/references/storage-class-parameters/#stale-replica-timeout-field-parametersstalereplicatimeout