Skip to content

Update read-only repository count metric to account for lazy failover

Sami Hiltunen requested to merge smh-unavailable-repos-metric into master

Read-only repository count metric previously reported the number of repositories that were outdated on the primary. As Praefect no longer promotes outdated replicas as primaries, this metric is not really useful anymore. With lazy failover in place, Praefect will failover to an up to date replica as long as there is a healthy one available. The purpose of this metric was to alert when a repository's availability was degraded, mainly the writes being blocked. With lazy failover, we no longer would block the writes as we'd simply promote the up to date node. Praefect hasn't served reads from outdated replicas since 7af9c950. Having no fully up to date healthy replicas means the repository is fully unavailable. There's effectively no more read-only mode. This MR updates the metric to count repositories which are unavailable according to the new failover logic. The old metric name is kept in place though as some alerting depends on it.

This MR also adds gitaly_praefect_unavailable_repositories metric which does the same thing but has a more accurate name. This way we can update the uses of the read-only metric to be more accurate, and later remove the old metric.

Related to: #3207 (closed)
Documentation: gitlab!62704 (merged)
Depends on: !3543 (merged)

Edited by Sami Hiltunen

Merge request reports