Geo: Planned failover on staging.gitlab.com
Summary
Now that Geo is enabled on gitlab.com staging, we want to investigate what would be required to perform a planned failover on the staging environment. As DR business requirements are established as part of infrastructure's Q3 OKR, Geo will likely be one of the options considered.
Discussion in Slack from 2020-09-30:
joshua 4 hours ago Hey Geo team! Have we tested failover on Staging, if not, is there a plan to try?joshua 3 hours ago @Nick Nguyen - do you know? (edited)
alexives 3 hours ago I don't believe that we have, and while I'm not aware of any plans that doesn't mean there aren't any, however at the moment the staging secondary may not be viable for DR since we utilize selective sync and only synchronize a small number of groups. :+1: 1
Nick Nguyen 3 hours ago Thanks @alexives. That’s my understanding as well
joshua 3 hours ago Thanks both! For context this is coming up in PM/Eng meeting
alexives 3 hours ago Staging is a large environment, but one of the reasons we use selective sync there is because our secondary is not an HA geo deployment. I don't think our current setup could handle a full sync.
joshua 3 hours ago Understood, maybe we could create a quick issue to plan? With the work on production likely moving forward soon, it would be good to discuss whether it makes sense to have a failover test on staging, and the corresponding cost.
joshua 3 hours ago cc @fabian
Nick Nguyen 2 hours ago @joshua I can open up a planning issue. Though I wasn’t aware production Geo was going to move forward. Is there an issue/doc with recent discussion about that effort?
joshua 2 hours ago @Andrew Thomas - can you link the best issue?
Chun Du 2 hours ago For the context, deploying Geo on GitLab.com was decided to postpone to FY22 #202073 (comment 340683037) due to high cost. Would love to see what has been discussed recently. (edited)
Chun Du 2 hours ago BTW, it took ~7 months to deploy on staging &1908 (closed). (edited)
joshua 1 hour ago Yep. Infra has a Q3 OKR to establish a plan for DR, and provided it took awhile for Staging it makes sense to start getting the ball rolling for .com soon.
joshua 1 hour ago Can do something iterative like we did on staging, start with a small selective sync, then expand, etc. :+1: 2
Andrew Thomas 1 hour ago @joshua might be a bit premature to move forward with enabling geo failover on staging right now. I think we need to first establish the business requirements (which we should have by end of this quarter) and work backwards from those to identify the best implementation option (may or may not be geo)
joshua 23 minutes ago @Andrew Thomas from what I have heard, there seems to be greater appetite for Geo.
joshua 22 minutes ago May be worth connecting with Marin if you haven't since Friday
Considerations
We'll add to this list as the discussion continues, but here are a couple that immediately come to mind:
- Selective sync is currently enabled on staging and we only synchronize a handful of groups (gitlab-org, gitlab-com, and some test groups)
- The staging primary uses a Patroni managed PostgreSQL cluster and the Geo secondary is a single-node site. Work is currently in progress to add experimental support for Patroni on a Geo secondary site.
- Geo does not replicate and verify 100% of GitLab data. The work we are doing as part of Disaster Recovery viable and complete maturity is closing the gap, but there will still be some unreplicated data that will need consideration.