Metrics to track replication lag for Geo
Dedicated has a FY27Q1 target of 15 mins RPO, here are the current baseline and alerting requirements:
RPO Baseline Measurement and Alerting Requirements
Problem with current RPO metric
The definition of RPO is: PostgreSQL Database Replication Lag + Geo Log Cursor Processing Lag + Sidekiq Queuing Lag + Sidekiq Processing Lag
Defining a usable apdex from these 4 metrics does not work. Especially because the sidekiq replication lag is measured per sidekiq pod, which makes this basically impossible to use on our end. We delivered the best effort possible with what we have available and that alerting is unusable. We hit an unexpected /0 which breaks the calculation in practice.
Having access to the individual metrics is useful to surface in the dashboard but we need the single combination metric calculated at app level to alert on.
What we need
What we need is a single end-to-end metric per Geo replicated data type: https://docs.gitlab.com/administration/geo/replication/datatypes/
Per the current definition for RPO, the sum of the 4 metrics is correct: an event must replicate to the replica DB, then be picked up by the log cursor, then a job is queued to be picked up by sidekiq, then processed (data copied); the sum of those lags is the overall RPO. In theory it’s valid, it just breaks in practice. The problem is it’s not an apdex, it’s saturation (0 -> RPO limit, with 100% being saturated), with this we encountered the unexpected divide by zero which breaks the definition.
Benefits of per data type lagging metrics
Geo can keep their existing metrics (no breaking changes) and add this new one per replicated data type, which unblocks us and will give us as accurate a measurement as is possible for our baseline.