Warning about the scale of un-replicated data in DR failover
When a disaster occurs and a secondary is promoted, there may be data that successfully saved to the primary but was not replicated to the secondary before the disaster occurred. This data should be treated as lost.
The administrator should be given a sense of scale of the quantity of data loss.
@brodock describes this https://gitlab.com/gitlab-org/gitlab-ee/issues/4209#note_49788859:
Let's consider this overall state:
primary
last event
id is: 150, secondarylast processed event
id: 130so when we promote secondary, we are actually missing 20 events to replicate. If we save the delta that is missing somewhere, we can then reconsiliate the state when the old primary is online again
when using hashed storage, we can be sure that the only events there are safe to be re-processed after
The major issue is how to reconciliate the database, if database is too far behind we loose whatever is there, and there is no easy way to "rebuild" it.
So, failing over is a destructive action from the point of view of non replicated data.
In the future, when we add checksum to repositories, we may be able to point out repositories that diverge (when we get old primary back online, in a reconciliation state).
The rationale here is that no one is going to use GitLab or git to process payments or do anything that requires global ACK in a transaction level.
So:
1- Losing 10 minutes of data is better than 1 whole company can't work because primary is down 2- People have important code replicated in their machines anyway (that's the premise of git), so they can push again to the repositories whatever needs to be pushed when a new primary is promoted 3- May loose changes in issues/merge request and comments (although this is mitigated as well because people may receive email notifications with the changes, so if anything is utterly important, someone have a copy in their mailboxes.
That's the trad-eoff of asynchronous replication vs synchronous. We need to be clear with our customers about what they should expect and the implications of doing failover in that state.
Proposal
Does the tracking database have information we can use to help the admin understand the number of events that will be 'lost' or very difficult to recover? I'm imagining feedback in the promote-secondary-to-primary
task like:
At least 28 events have not been replicated from the primary.
Last known event: 112,342 (2017-11-29T12:21:24.123Z)
Last replicated event: 112,314 (2017-11-29T12:21:07.789Z)