You need to sign in or sign up before continuing.
Replication lag on Postgres DR archive replica is over 3 hours and growing
Summary
The "archive-replica" Postgres database has abnormally large and growing replication lag.
Note that this postgres instance receives its transaction data from WAL file shipping (not from streaming replication like the Patroni nodes).
More information will be added as we investigate the issue.
Timeline
All times UTC.
2020-03-10
- 08:50 - Replication lag starts to climb above 10 minutes. Lag increases almost continuously, indicating that transactions are being generated faster than they are being replayed on this replica.
- 15:26 - Pager Duty alerted that replication lag exceeds 3 hours: https://gitlab.pagerduty.com/incidents/PW678P6
2020-03-11
- 01:31 - Pager Duty alert recovered
Resources
Graph of replication lag
PromQL: (pg_replication_lag > 600) and on(instance) (pg_replication_is_replica{type="postgres-archive"} == 1)
Graph of rate of change in replication lag
PromQL: deriv(pg_replication_lag{env="gprd", type="postgres-archive"}[5m]) < 1
Edited by Devin Sylva