Skip to content

Postgres replication lag alert for logical slots + fix label matcher (postgres** -> patroni** for fqdn regexp)

Nikolay Samokhvalov requested to merge nik-alert-high-pg-lag-for-logical into master

Issue: gitlab-com/gl-infra/reliability#16867

Current alerts capture only lags for physical replication. For logical replication, high lags are going to be unnoticed. At the same time, we have the info we need to cover logical replication as well – since we always use replication slots, the info provided by pg_replication_slots on the source (primary, publisher), already collected in metric confirmed_flush_lsn_bytes, is what we need.

We already have two types of alerts for replication lag (PostgreSQL_ReplicationLagBytesTooLarge for lag measured in bytes and PostgreSQL_ReplicationLagTooLarge for lag measured in seconds), this is the 3rd control – probably some optimization is needed here. This new alert, unlike two existing ones, covers logical replication (existing alerts cover physical replication only).

Additionally, label matcher was fixed in 3 existing places (for the existing alerts) – instead of postgres-.*-2004-.*, use patroni-.*-2004-.* in the regexp for fqdn – the affected alerts are:

  • PostgreSQL_WALGReplicationStopped (reverted to old "postgres*" since it's managed by Omnibus)
  • PostgreSQL_ReplicationLagTooLarge_DelayedReplica (reverted to old "postgres*" since it's managed by Omnibus)
  • PostgreSQL_ReplicationLagBytesTooLarge
Edited by Nikolay Samokhvalov

Merge request reports