Geo DR: Making it work for GitLab.com scale
This is a complicated topic, but we're hoping to make Geo DR a tool that can work at GitLab.com scale. Ideally, this means it would be possible for GitLab to move between cloud providers with minimal downtime.
- We have to move all the data across the public Internet, which may be capped by a slow 100 MB/s.
- We want to make this as transparent to the user as possible. People shouldn't notice a transition or failover.
- We have one Geo primary instance named
Legacyand secondary instance named
- We'd like to use as much Geo code as we can. We could use
rsyncor the Bacula backups to copy some data, but we still need Geo to help verify/ensure the data we have is current.
- We have 70-100TB of data that has to be migrated as quickly as the network allows.
- Since this will take a significant amount of time (weeks/months), we ideally want something that can continuously track what data has changed, synced it, verify that it is correct.
- We want to reach a state where we can disable writes to
Legacy, allow secondary to catch up, and switch over to
ShinyCloudwith minimal downtime.
How Geo will work in 9.x
ShinyClouduses PostgreSQL replication to stream all updates from
ShinyCloudhas a scheduler that attempts to clone/pull all repos and file attachments that it has not downloaded.
- Whenever a push or some change occurs in
Legacy, it updates a Geo event log table.
ShinyCloudsees this entry and will act on it (e.g. run
- There is a basic monitoring page on the
/admin/geothat shows the current number of repositories/files that have been synched.
Issues to consider
- Speed: Network bandwidth will really be a limiting factor. If
Legacycan only send data at 100MB/s, then in the best case it will take 74 days to transfer everything.
- Consistency checks: We need to be verify each individual repository and and files are correct.
System Hooks: The current implementation may not be a scalable solution for triggering updates because it causes lots of unnecessary work (e.g. duplicate
git clones). How can we improve this?
- Douglas: Use backfilling to update instead of relying on system hooks, queue this event, and schedule a backfill instead of doing a pull right away
- Gabriel: with the new SystemHook that is 1:1 to a push, the "load" on the secondary would be similar to the load of processing the
GitTagPushService, but it's not 100% reliable so we still need the backfilling to make sure we have everything we need on the other side.
- Stan: We may have to do away with system hooks entirely. Replace with audit or replication logs.
- Monitoring: How do we ensure progress is being made? We need to monitor everything: transfer rates, files/repos missing, etc.
- Setup/installation: Do we need to mimic all the exact same storage paths on
Legacy? Will that be a problem?
What else do we need?
Where do we start?
Here's a strawman proposal:
- Set up a Geo secondary in another cloud that streams a read-only DB of GitLab.com.
- Add config options in Geo that allow admins to selectively check what should be downloaded to the secondary.
- Turn off system hooks to start.
- Start with replicating a few repositories.
- Once this appears to be working, stream uploads, other attachments, LFS, etc.