Skip to content

Missing metrics for Praefect replication queue depth

Summary

In gitlab-com/gl-infra/production#6133 (closed) we encountered a replication queuing alert which was very difficult to reason about since we were unable to see the queue depth as a metric.

this is a corrective action for the incident, @pks-t comments on Slack

it's really unfortunate we don't have this metric anymore (if we ever did, but I feel like we had it once). Might be a corrective action to add it back in, potentially as part of the datastore collector which regularly polls the database

Impact

The impact of this is that on getting the alert we spent a significant amount of time digging into the root cause of the delay without fully understanding the impact of how long the replication delay would be.

Recommendation

Evaluate whether a metric can be added for hte queue depth so we can add this to our service dashboard.

Verification

For verification, we can disable a praefect node (in a test environment or staging).

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information