Investigate possibility of delaying geo replication to prevent rewinding on failover.

What?

The replica postgres database in a geo instance can be configured to delay recovery from WAL segments by some time (recovery_min_apply_delay).

It's possible that by doing this we can prevent archive recovery failing after a lossy postgres failover, by delaying recovery so that it's likely that the new master will have already uploaded it's divergent timeline (see Why? section).

Eventually, once the data loss issue is fixed and forked timelines are not uploaded to GCS, we could remove this intentional lag.

Why?

If a lossy failover occurs, we have observed archive recovery instances, including DR Geo (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7293), break replication due to following the old master's timeline in GCS, which is then superseded by the new master's divergent timeline.

If this occurred in Geo, replication would stop, and we would have to either restore a snapshot, or use pg_rewind, and then let the replication resume along a divergent timeline. If we did this, even ignoring the probably manual burden, we would have to convince ourselves that there are no edge cases regarding the tracking database if we rewind, then diverge, the geo replica. Would any repos end up in an inconsistent state?

Note that diverging the geo database in this way is possible if we have lossy failover even if we were to use streaming replication instead of archive recovery (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7293#note_254775483).

Is this even possible?

A major potential roadblock is that I'm not sure using the recovery_min_apply_delay setting will help us at all. I can't remember if the dr-delayed instance (an 8h delayed replica in production) suffers the same desyncronisation problems after cluster failover. If it does, that proves this doesn't work.

I think the mechanism for recovery is that wal-e downloads the segments as soon as they're available, and postgres waits for the segments to reach the configured age before consuming them. If wal-e has already followed the wrong timeline, perhaps it's too late? I might have misunderstood all of this.

@abrandl @Finotto can you set me straight on whether or not the dr-delayed instance has broken replication after a failover in the same way the dr-archive does, and whether my mental model of how recovery proceeds is correct?

Apparently geo monitors drift between HEAD (or some branches? not sure) in the replicated repos, and the geo replica DB. If this is the case, it's vulnerable to a race in which geo fetches from the primary in response to a database event (https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/services/geo/repository_base_sync_service.rb#L93), but right before it does more commits are pushed to the primary. These are fetched, leaving a discrepancy. This race can always occur, but if we intentionally lag it will be exacerbated. @dbalexandre would this be a pain for geo operators?

RFC @rnienaber @dbalexandre @ashmckenzie @devin

@rnienaber This isn't strictly needed for the staging rollout, but if you feel it belongs in another/no epic please move it.

Edited Dec 11, 2019 by Craig Furman