Now that we've switched over to the GCP, let's talk about the future of Geo at GitLab.
In a call with @sytses (https://www.youtube.com/watch?v=JQ3fUGs151I), we discussed that we think that we still should be running Geo on another region in GitLab for multiple reasons:
Geo has identified and fixed lots of bugs/issues within GitLab itself (e.g. missing uploads, failures during renames, database inconsistencies, etc.)
Running Geo at GitLab.com scale has revealed lots of bugs and performance issues with Geo itself
We were able to recover some lost data on GitLab.com
To save costs, we can probably scale down the Geo secondary fleet size significantly and use slower disks. We may even want to add a delay to the database replication to ensure Geo can function as a disaster recovery solution in case someone drops the database etc.
I think we need to loop security into this effort as well as this would be an entire copy of non-redacted production data which means it would need to be held to the same standard as customer data.
@jarv thank you for looping the security team in. I agree that this efforts merits a security review as well.
Since we are talking about disaster recovery here, is it worth considering setting up Geo in an alternative cloud provider environment, such as AWS? We don't necessarily need to, but I haven't seen this question asked/considered yet.
Since we are talking about disaster recovery here, is it worth considering setting up Geo in an alternative cloud provider environment, such as AWS? We don't necessarily need to, but I haven't seen this question asked/considered yet.
I was thinking about this but it would add a bit of overhead to our provisioning setup and expand our attack surface quite a bit. I am going to start talking with @jurbanc when he returns about a security review and I think whatever actions come out of that would apply to this environment.
@dawsmith@glopezfernandez Do we have a timetable when we might do this? We are currently flying a bit blind now that we don't have an actual Geo setup that is being used. At the very least, perhaps we should just run Geo with the Ops instance to start?
I propose eu-west-1 (Ireland). When Geo is mature enough to have this secondary be active, having a secondary in Europe gives GitLabbers outside of the US a closer instance to work with.
@MFarber Since this will be a complete copy of all production data, is there any regulation we need to consider regarding the physical location of customer data?
@stanhu and @dawsmith Who do I need to follow up with regarding the number/sizes of nodes, and other nodes/monitoring?
So EU options are: London, Belgium, Frankfurt and the Netherlands. I believe the two with best connectivities are Frankfurt and London (because of financial industries).
I completely support pushing a secondary installation of gitlab.com to another country. Regarding feature completeness, any region in the EU will suffice based on what we require. So let's rock out London (europe-west2).
In my opinion, I would love to see our use of GEO as a secondary hot datacenter for which we can serve traffic from. Building this in an iterative fashion, I think it's fine to start off with a small fleet and build it up.
Regarding the capability to perform disaster recovery with our databases, I'd prefer to rely on the technologies allowed by postgres and supporting backup mechanisms that we are actively using/investigating.
@rnienaber US Fed. Gov't/DoD customers & subcontractors that use .com; this would be an area of research. I will figure out who on our federal accounts team to speak with. Thanks for looping me us in so early in the process.
@rnienaber As per our conversation, I have added notes about control standards that would be met with the implementation of multi-region infrastructure. I will add more granular details as the plan progresses. Please advise if you have any questions or concerns. Thanks, as always, for including me!
@Finotto while the document you link is in an issue and epic, I believe the one mentioned in this issue ( the same one @skarbek posted) has been fleshed out with ideas and commentaries; maybe we can swap the doc in the issue and epic for the more complete document?
I really like where this is going, and reading the design has been very informative. I do feel we need to consider the overall picture and provide the data (observable availability) necessary to align all the stakeholders . DR is one area where the intersection between business requirements, engineering effort, and financial considerations require careful equilibrium. This will become more evident and pressing as the company grows and matures given the investment DR requires.
We need to have a more structured framework to map out Infrastructure's DR needs (which are driven by the business) which will in turn, and coupled with customer needs, drive requirements for Geo. Let's take a quick step back and do a quick roadmap of our DR strategy. Clearly, we have Geo in its current incantation, but we need to provide structured input into where it needs to go.
How do we frame this conversation so that we can take the business input, outline financial considerations, and determine the engineering effort required to provide the appropriate DR solution? One potentially useful starting point is the 7 tiers of DR, which is an industry-wide definition of DR levels.
What are our capabilities today with regards to the 7-Tier model?
Where does Geo fit in that model?
Is there a delta?
Where should we be heading to tomorrow?
What are the RTO and RPO and what do those translate to in terms of engineering and financial investment?
We don't have to have exact answers today: estimates work as a initial iteration. With the model in hand, tune this design to define RTO/RPO goals and estimate costs within the model around our capabilities. This allows us to travel up and down the model for the business to understand what's possible at a given investment level and decide the required investment to meet our needs today and tomorrow, and drive requirements into Geo.
Keep in mind that from a security perspective, disaster recovery does not always have the same objective as security recovery.
Disaster recovery's primary objective is to ensure business continuity.
Security recovery focuses on protecting data assets after a breach. Often, security recovery plans have to be more dynamic because there is an adversary (human element) involved. Therefore, response requirements for security recovery is less public (investigations, root cause analysis, evidence collection, etc.) than for disaster recovery after a weather-related phenomena, for example.
From a tactical standpoint, disaster recovery is about fast and accurate data recovery, while security recovery is about implementing protective controls to prevent future loss of data.
@stanhu@rnienaber If we do decide to start a build-out of Geo for GitLab.com it would be much nicer to do it at first for a selection of projects instead of the entire deployment. Is making selective sync work at gitlab.com scale something that is still on the radar of the geo team?
After a discussion with @nick.thomas it looks like we would definitely need https://gitlab.com/gitlab-org/gitlab-ee/issues/4645 for this to happen. I think there may be some great benefits to selective sync as we could do something like enabling geo for new repositories or start out by enabling it for internal gitlab.com repositories without the expense and overhead of duplicating all of our storage.
thanks @rnienaber ! I think this would definitely be helpful for us, though not required of course if we are willing to replicate every single project in a large geo deployment. If there isn't much work to fix the FDW issue I think it is worth taking another look at it.
@dawsmith@jarv What are the next steps here? In Geo we have scheduled work for 11.6 to clear any security issues so that we can have a security review in (hopefully 11.7). We are also scheduling anything else that seems to impact a rollout to GitLab.com. How do we go from the doc in the MR to a plan of action?
@rnienaber - we need to make a recommendation for the specific design of what is to be built - I was going to merge in @skarbek's high level design and start an Epic where we can make some initial issues with coming up with specific details of what will be built and what strategy we are following for DR. That will be worked in out next 2 week milestone.
next milestone:
Pick strategy and design out infra to be built
Start to talk to GCP about that plan - we'll need some quota increases in the region(s) we pick.
2 weeks away - next next milestone: start to build