Geo DR: Making it work for GitLab.com scale
This is a complicated topic, but we're hoping to make Geo DR a tool that can work at GitLab.com scale. Ideally, this means it would be possible for GitLab to move between cloud providers with minimal downtime.
Assumptions
- We have to move all the data across the public Internet, which may be capped by a slow 100 MB/s.
- We want to make this as transparent to the user as possible. People shouldn't notice a transition or failover.
- We have one Geo primary instance named
Legacy
and secondary instance namedShinyCloud
. - We'd like to use as much Geo code as we can. We could use
rsync
or the Bacula backups to copy some data, but we still need Geo to help verify/ensure the data we have is current.
Problem
- We have 70-100TB of data that has to be migrated as quickly as the network allows.
- Since this will take a significant amount of time (weeks/months), we ideally want something that can continuously track what data has changed, synced it, verify that it is correct.
- We want to reach a state where we can disable writes to
Legacy
, allow secondary to catch up, and switch over toShinyCloud
with minimal downtime.
How Geo will work in 9.x
-
ShinyCloud
uses PostgreSQL replication to stream all updates fromLegacy
. -
ShinyCloud
has a scheduler that attempts to clone/pull all repos and file attachments that it has not downloaded. - Whenever a push or some change occurs in
Legacy
, it updates a Geo event log table.ShinyCloud
sees this entry and will act on it (e.g. rungit pull
). - There is a basic monitoring page on the
/admin/geo
that shows the current number of repositories/files that have been synched.
Issues to consider
- Speed: Network bandwidth will really be a limiting factor. If
Legacy
can only send data at 100MB/s, then in the best case it will take 74 days to transfer everything. - Consistency checks: We need to be verify each individual repository and and files are correct.
-
System Hooks: The current implementation may not be a scalable solution for triggering updates because it causes lots of unnecessary work (e.g. duplicategit clones
). How can we improve this?- Douglas: Use backfilling to update instead of relying on system hooks, queue this event, and schedule a backfill instead of doing a pull right away
- Gabriel: with the new SystemHook that is 1:1 to a push, the "load" on the secondary would be similar to the load of processing the
GitTagPushService
, but it's not 100% reliable so we still need the backfilling to make sure we have everything we need on the other side. - Stan: We may have to do away with system hooks entirely. Replace with audit or replication logs.
- Monitoring: How do we ensure progress is being made? We need to monitor everything: transfer rates, files/repos missing, etc.
- Setup/installation: Do we need to mimic all the exact same storage paths on
ShinyCloud
asLegacy
? Will that be a problem?
What else do we need?
Where do we start?
Here's a strawman proposal:
- Set up a Geo secondary in another cloud that streams a read-only DB of GitLab.com.
- Add config options in Geo that allow admins to selectively check what should be downloaded to the secondary.
- Turn off system hooks to start.
- Start with replicating a few repositories.
- Once this appears to be working, stream uploads, other attachments, LFS, etc.
/cc: @pcarranza, @northrup, @brodock, @regisF, @dbalexandre, @rspeicher
Edited by Stan Hu