Build a scalable, self-service geo replication and verification framework
Hand some data to an engineer and they will not replicate it. Teach an engineer how to replicate data and customer happiness shall be with you. - The Wise Geologist
Introduction
There are several discussions within the Geo team regarding the limitations of our current replication system. As of October 2019 only ~50% of data types (we need a better name) are replicated and of those only ~41% are fully verified. This is known and we have made some efforts to change this situation by trying to replicate the remaining data types and by trying to verify those data types. As part of those efforts we learned that replicating data types is hard and so is verifying those data.
There are several technical reasons for this, including the current architecture, the differences in data types, usage of FDW in combination with selective sync etc.; however, these are not solely responsible for the difficulties. GitLab is growing rapidly and teams across the organisation are adding features at a rapid pace. Many of these features add new data types and are not initially designed to be easy to replicate; for example, GitLab Pages, server-side hooks, and Design Repositories. This is very likely through a lack of knowledge and because Geo is not considered during initial designs. Engineers across the company are not empowered to easily support geo replication and consequently the Geo team becomes a bottle neck.
In order to address both the technical challenges and the operational limitations, I propose to build a new geo replication and verification framework with the explicit goal of enabling teams across GitLab to add new data types in a way that supports geo replication out of the box. It should be incredibly easy for engineers to do the right thing. The Geo team should develop the framework and offer support to the organisation but would no longer be responsible for most of the implementation.
Problem to solve
- Geo is usually not considered by other teams when implementing features
- Customers expect new features (and their data) to be replicated either for performance or Disaster Recovery purposes
- Adding new data types to Geo is hard and can only be performed by the Geo team
- Verification of data types is difficult and does not perform well. Again only the Geo team can do this
- The company is growing rapidly and adds new features; the current operational mode is not scalable
- Software developers across GitLab are not empowered to make their features geo-compatbile
Intended users
Further details
There are currently some technical considerations on how to iterate on Geo:
Proposal
- Create a Geo replication framework that is so easy to use that every software developer in GitLab can utilise it to make a new feature "Geo compatible"
- The framework should abstract away many of the low level functionalities e.g. verification so engineers don't need to worry about them
- Create educational material, workshops etc. to teach folks how to use it.
@toon wrote up some pseudocode of how this could look like:
class MyCoolNewFeatureModel < ApplicationRecord
include Geo::Replicable
geo_replicate_repository :cool_repository # name/prefix of the column(s) where Geo can find the repo
# rest of the code unrelated to Geo
# ...
end
Documentation
- We would need to create documentation for this. The better the documentation, the more likely this is to take off
Testing
- Geo replication must be extensively tested on all levels because of its relevance for Disaster Recoveyr
What does success look like, and how can we measure that?
- Percentage of new features that are geo replication and DR ready out of the box (target: 80% of new features implemented by GitLab support Geo replication and DR out of the box)
- Time it takes for software developer to become productive using the Geo framework
- Number of steps needed to make a new data type geo-compatible
What is the type of buyer?
- Premium
- Ultimate