Missing metrics for Praefect replication queue depth
Summary
In gitlab-com/gl-infra/production#6133 (closed) we encountered a replication queuing alert which was very difficult to reason about since we were unable to see the queue depth as a metric.
this is a corrective action for the incident, @pks-t comments on Slack
it's really unfortunate we don't have this metric anymore (if we ever did, but I feel like we had it once). Might be a corrective action to add it back in, potentially as part of the datastore collector which regularly polls the database
Impact
The impact of this is that on getting the alert we spent a significant amount of time digging into the root cause of the delay without fully understanding the impact of how long the replication delay would be.
Recommendation
Evaluate whether a metric can be added for hte queue depth so we can add this to our service dashboard.
Verification
For verification, we can disable a praefect node (in a test environment or staging).