Exploring the DR flow on GitLab.com
Introduction
We are addressing a regional, catastrophic outage where the entire region is offline or so degraded that it is unclear when service will recover. RTO targets will not be kept by doing nothing.
The below is based on a discussion with @brentnewton and @hphilipps to determine possible recovery flows.
Requirements
- RTO target for Premium+ customers: 1 hour.
- RPO target for Premium+ customers: 10 minutes
- Free users must be recovered on a best-effort basis and we can't loose the data. Maybe 1 hour RPO is acceptable.
- The cost of the DR site must be controlled; it is largely driven by Free user Git data.
DR site overview
We focused on four high-level components:
- Stateless services - API nodes, Web nodes etc.
- PostgreSQL Database ("Patroni")
- Data in Object Storage
- Gitaly and Git data
graph LR
patroni1 -- "WAL level or Streaming replication" --> patroni2
oo1 -- "Cross-region replication" --> oo2
git1 -- "???" --> git2
subgraph "gitlab.com"
stateless1[Stateless services]
patroni1[Database nodes]
oo1[Object storage buckets]
git1[Git data]
end
subgraph "dr.gitlab.com"
stateless2[Stateless services]
patroni2[Database nodes]
oo2[Object storage buckets]
git2[Git data]
end
- Stateless services can be scaled down on
dr.gitlab.com
- Patroni nodes can be scaled down somewhat
- Git data is 95% from free users; means of replication TBD
How to replicate Git data?
The main constraint on the above is that Free user Git data is the main cost driver. Git data is kept on SSDs and replicating all data via e.g. Geo would double cost. These are some proposals on how to address this.
Git data replication - GCP Disc snapshots only
We already take regular snapshots of SSDs that store Git data. At its simplest we could rely only on those snapshots. The maximum frequency of snapshots is 10minutes - barely within our RPO target.
graph LR
git1 -- "GCP Disc snapshots - every 10mins" --> git2
subgraph "gitlab.com"
git1[Git data]
end
subgraph "dr.gitlab.com"
git2[Git data]
end
- Cost of snapshots not known yet
- No separation between Premium+ and Free
- Maximum data loss for Premium+ is 10 mins
- Need to restore all disc snapshots and bring up Gitaly nodes - this may exceed the time within our RTO (need to test)
Git data replication - GCP Disc snapshots for Free, continuous replication for Premium+
This approach would utilize GCP Disc snapshots for Free user Git data likely at an hourly cadence. We would separate out Premium+ data into specific Gitaly nodes, snapshot discs as well as a backup e.g. every 10 minutes, but rely on some other more continuous mechanism to replicate Git data (e.g. Geo or another technology).
graph LR
git1 -- "GCP Disc snapshots - every hour" --> git2
git1 -- "Continuous replication" --> git3
subgraph "gitlab.com"
git1[Git data]
end
subgraph "dr.gitlab.com"
git2[Free user Git data]
git3[Premium+ Git data]
end
- Cost of GCP Disc snapshots not known
- Premium+ data would be as up-to-date as possible (ideally less than a minute behind) and well within our RPO target. Maxium data loss is still 10mins via backup snapshots
- Premium+ data would not need to be restored from snapshots; service can resume faster
- We would need to restore Free data later on
- We need a way to restrict GitLab access to Premium+ users
Git data replication - Gitaly uses Object Storage (hypothetical)
If Gitaly had the capability to store data in Object Storage, this would become essentially a no-op. We could rely on cross-region replication to replicate data.
Known product gaps
- We have no way to restrict logins to Premium+ users only
- We can select storage shards for Geo but AFAIK we don't have the ability to separate Premium+ git data onto specific nodes and discs
- We have no automated promotion process for Geo
- We can't easily route traffic to another site
Testing DR procedures
@hphilipps had a few ideas:
- Ideally we could route a small percentage of users to a secondary site
- We can perform bubble tests to test the entire process (promote the dr.site while isolated from the primary)
Open questions
- How do we replicate Redis? Do we need to care about Sidekiq queues?
- There are many small other DR steps that would need to be automated - what are they?
- How do we handle a situation in which the database is more up-to-date compared to Git data for free users?
- How do we handle a failback? Do we need to?
Failover flow overview
If we chose Git data replication - GCP Disc snapshots for Free, continuous replication for Premium+
this is how the entire DR process could look. I am assuming a full outage for now and am still omitting many details.
- Determine that a service outage has occurred and that recovery is not possible in place without violating RPO/RTO targets
- Inform all users about initiating DR procedures and what to expect
- Prevent the primary site from starting if at all possible (split brain prevention)
- Scale up all stateless services to a desired level
- Promote database cluster
- Scale up database cluster
- Determine that all Git data for Premium+ users is available (otherwise resort to disc snapshots)
- Lock out all non-free users
- Reconfigure
dr.gitlab.com
to become read-writable - Other DR steps. e.g. DNS repointing etc.
- Allow Premium+ users to login
- Service should be restored for Premium+ users; Free users are locked out <------ Must be within one hour
- Scale up Gitaly nodes and restore Free user snapshots of Git data
- Allow free users to login
- All services operational