Geo: last-processed event ID is sometimes larger than maximum event ID
Summary
Intermittently seen while performing preflight on gprd.gitlab.com
The primary generates Geo::EventLog
entries, which are passed to the secondary via streaming replication. Because of transactions, entries may sometimes be generated out-of-order (i.e., event 2 is seen by the secondary, then event 1).
As the secondary processes each event, it updates Geo::EventLogState#event_id
with Geo::EventLog#id
For a short period of time, I observed that Geo::EventLog.maximum(:id)
as read from either prd
(the primary, so before streaming replication) or gprd
(the secondary, so after streaming replication) was a lower number than that seen in Geo::EventLogState#event_id
. This should be impossible.
There are multiple database servers on both the prd
and gprd
sides. The topology looks somewhat like:
* Azure
* postgres-A (Azure primary)
* postgres-B (Azure secondary)
* GCP
* postgres-A (Azure secondary, GCP primary)
* postgres-B (GCP secondary)
* postgres-C (GCP secondary)
* postgres-D (GCP secondary)
* postgres-C (Azure secondary)
* postgres-D (Azure secondary)
Database load-balancing is enabled and reads may happen against any random secondary. Writes always go to the primary.
Steps to reproduce
???
What is the current bug behavior?
???
What is the expected correct behavior?
???
Relevant logs and/or screenshots
My terminal session from the period. Hardcoded numbers were generated by running Geo::EventLog.maximum(:id)
on prd
and copying the number over manually from there.
irb(main):001:0> Geo::EventLogState.last_processed
=> #<Geo::EventLogState event_id: 69624863>
irb(main):002:0> 69622710 - 69624863
=> -2153
irb(main):003:0> 69625675 - Geo::EventLogState.last_processed.event_id
=> -41
irb(main):004:0> Geo::EventLogState.last_processed.id
=> 69625922
irb(main):005:0> Geo::EventLogState.last_processed.id
=> 69626329
irb(main):006:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.id
Host postgres-05-db-gprd.c.gitlab-production.internal came back online
=> 68
irb(main):007:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.id
=> -48
irb(main):008:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.id
=> 0
irb(main):009:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.id
=> 0
irb(main):010:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.id
=> 0
irb(main):011:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.id
=> 0
irb(main):012:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.id
=> 0
irb(main):013:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.id
=> 0
irb(main):014:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.id
=> 0
irb(main):015:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.id
=> 0
irb(main):016:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.id
=> 0
irb(main):017:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.id
=> 0
irb(main):018:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.event_id
=> 0
irb(main):019:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.event_id
=> 0
irb(main):020:0> Geo::EventLogState.count
=> 1
irb(main):021:0> Geo::EventLogState.count
=> 1
irb(main):022:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.event_id
=> 0
irb(main):023:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.event_id
=> 0
irb(main):024:0> Geo::EventLog.maximum(:id) - Geo::EventLogState.last_processed.event_id
=> 0
irb(main):025:0> Geo::EventLogState
=> Geo::EventLogState(event_id: integer)
Output of checks
This bug happens on GitLab.com
Possible fixes
Theory: replication proceeds at different speeds. Perhaps the log cursor was speaking to a fast postgresql server in GCP and processing events that it could see before the rails consoles in either Azure or GCP could.
It's a bit sketchy though. I don't believe it strongly.
If it is what's happening, does it have any further implications for coherency beyond scaring me a little on a Thursday evening?