Geo DR: Making it work for GitLab.com scale

This is a complicated topic, but we're hoping to make Geo DR a tool that can work at GitLab.com scale. Ideally, this means it would be possible for GitLab to move between cloud providers with minimal downtime.

Assumptions

We have to move all the data across the public Internet, which may be capped by a slow 100 MB/s.
We want to make this as transparent to the user as possible. People shouldn't notice a transition or failover.
We have one Geo primary instance named Legacy and secondary instance named ShinyCloud.
We'd like to use as much Geo code as we can. We could use rsync or the Bacula backups to copy some data, but we still need Geo to help verify/ensure the data we have is current.

Problem

We have 70-100TB of data that has to be migrated as quickly as the network allows.
Since this will take a significant amount of time (weeks/months), we ideally want something that can continuously track what data has changed, synced it, verify that it is correct.
We want to reach a state where we can disable writes to Legacy, allow secondary to catch up, and switch over to ShinyCloud with minimal downtime.

How Geo will work in 9.x

ShinyCloud uses PostgreSQL replication to stream all updates from Legacy.
ShinyCloud has a scheduler that attempts to clone/pull all repos and file attachments that it has not downloaded.
Whenever a push or some change occurs in Legacy, it updates a Geo event log table. ShinyCloud sees this entry and will act on it (e.g. run git pull).
There is a basic monitoring page on the /admin/geo that shows the current number of repositories/files that have been synched.

Issues to consider

Speed: Network bandwidth will really be a limiting factor. If Legacy can only send data at 100MB/s, then in the best case it will take 74 days to transfer everything.
Consistency checks: We need to be verify each individual repository and and files are correct.
~~System Hooks: The current implementation may not be a scalable solution for triggering updates because it causes lots of unnecessary work (e.g. duplicate git clones). How can we improve this?~~
1. Douglas: Use backfilling to update instead of relying on system hooks, queue this event, and schedule a backfill instead of doing a pull right away
2. Gabriel: with the new SystemHook that is 1:1 to a push, the "load" on the secondary would be similar to the load of processing the GitTagPushService, but it's not 100% reliable so we still need the backfilling to make sure we have everything we need on the other side.
3. Stan: We may have to do away with system hooks entirely. Replace with audit or replication logs.
Monitoring: How do we ensure progress is being made? We need to monitor everything: transfer rates, files/repos missing, etc.
Setup/installation: Do we need to mimic all the exact same storage paths on ShinyCloud as Legacy? Will that be a problem?

What else do we need?

Where do we start?

Here's a strawman proposal:

Set up a Geo secondary in another cloud that streams a read-only DB of GitLab.com.
Add config options in Geo that allow admins to selectively check what should be downloaded to the secondary.
Turn off system hooks to start.
Start with replicating a few repositories.
Once this appears to be working, stream uploads, other attachments, LFS, etc.

/cc: @pcarranza, @northrup, @brodock, @regisF, @dbalexandre, @rspeicher

Edited Jul 18, 2017 by Stan Hu