Skip to content

Metrics to track replication lag for Geo

Dedicated has a FY27Q1 target of 15 mins RPO, here are the current baseline and alerting requirements:

RPO Baseline Measurement and Alerting Requirements

Problem with current RPO metric 

👉 Summary: It is a very low level metric that breaks

The definition of RPO is: PostgreSQL Database Replication Lag + Geo Log Cursor Processing Lag + Sidekiq Queuing Lag + Sidekiq Processing Lag

Defining a usable apdex from these 4 metrics does not work. Especially because the sidekiq replication lag is measured per sidekiq pod, which makes this basically impossible to use on our end. We delivered the best effort possible with what we have available and that alerting is unusable. We hit an unexpected /0 which breaks the calculation in practice.

Having access to the individual metrics is useful to surface in the dashboard but we need the single combination metric calculated at app level to alert on.

What we need

👉 Summary: end-to-end replication lag metrics per data type

What we need is a single end-to-end metric per Geo replicated data type: https://docs.gitlab.com/administration/geo/replication/datatypes/

Per the current definition for RPO, the sum of the 4 metrics is correct: an event must replicate to the replica DB, then be picked up by the log cursor, then a job is queued to be picked up by sidekiq, then processed (data copied); the sum of those lags is the overall RPO. In theory it’s valid, it just breaks in practice. The problem is it’s not an apdex, it’s saturation (0 -> RPO limit, with 100% being saturated), with this we encountered the unexpected divide by zero which breaks the definition.

👉 What would be more direct and better is for the application to report how far behind it is on any given data type.  At the application level, it knows when an object was uploaded or a git commit was pushed, and also when that data made it to the replica site, with the delta being the recovery-point, to compare to the RPO.  It’s far more direct, and if we have that per-data-type, we can build the RPO metric.

Benefits of per data type lagging metrics

Geo can keep their existing metrics (no breaking changes) and add this new one per replicated data type, which unblocks us and will give us as accurate a measurement as is possible for our baseline.

Edited by Lucie Zhao
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information