[meta] Disaster Recovery
Disaster Recovery is Generally Available for single secondary configurations
Customer want a disaster recovery solution to prevent their organization being severely impacted by a data center outage or some other major failure. We also want to be able to use such a solution for GitLab.com.
A key component of disaster recovery is making sure that data is replicated and current in another location that is accessible. GitLab Geo provides this foundation.
To offer a comprehensive disaster recover solution, everything needs to replicated and accessible. Not all of these are required for Disaster Recovery to be GA.
-
git %10.2 -
git LFS -
object storage gitlab-org/gitlab-ee#3944 (replicated externally) -
local (disk, NFS etc)
-
-
wiki %10.2 -
database (issues, merge requests, snippets etc) -
attachments (images on issues and merge requests) -
object storage gitlab-org/gitlab-ee#3944 (replicated externally) -
local (disk, NFS etc)
-
-
CI logs and artifacts -
object storage gitlab-org/gitlab-ee#3944 -
local (disk, NFS etc) gitlab-org/gitlab-ee#2388
-
-
GitLab Pages assets ( .html
,.css
,.js
etc that will be served) -
ElasticSearch gitlab-org/gitlab-ee#1186
Proposal
We want to offer a Disaster Recovery solution that our customers will want to buy, but also that we will be able to use it ourself for GitLab.com. GitLab.com is the biggest GitLab installation that we know of, and has its own constraints. However, we are confident that if we fix this issue for us, it will be beneficial for our customers, and we will be alerted of the potential bugs before our customers, making it a more solid product.
The feature will be called Disaster Recovery, once marketed.
-
%10.5 Single-secondary GA &17 (closed) -
%10.7 Multi-secondary GA &65 (closed)
Enhancements
-
Planned failover process migrating between data centers (like the GCP migration) -
Support Elasticsearch in Geo secondary nodes gitlab-org/gitlab-ee#1186
Implementation notes (geo related, not DR)
## Implementation approachWe tried the MinIO approach but realized it won't work for us for a variety of reasons. We are now investigating to build our own solution.
- Every attachment is tracked in the primary node's DB.
- Secondary nodes have a new tracking DB.
- We check periodically the tracking DB and find the highest
updated_at
timestamp - Find the first X timestamps in the primary node's DB that are later than this
updated_at
- Replicate those files and update the secondary node's table once it's done
- Rinse and repeat.
Previous releases
### Version-by-version breakdown9.0
-
Setup the tracking database -
Record file uploads in the database (https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8893) -
Much improved repository backfill mechanism (https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1197) -
Support for replicating LFS objects (https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1237) -
Support for basic status monitoring for Geo Nodes (see https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1197) -
Update Omnibus with these new changes
9.1
-
Make GitLab Geo easier to install for developers with GDK gitlab-development-kit!270 (merged) -
Make GitLab Geo installation process easy and automated (https://gitlab.com/gitlab-org/gitlab-ee/issues/1664) -
Add support for remaining file replication (e.g. attachments, etc.) https://gitlab.com/gitlab-org/gitlab-ee/issues/1955
9.2
-
Improve UX on Geo Nodes screen https://gitlab.com/gitlab-org/gitlab-ee/issues/1975 -
Resync repositories that have been updated recently https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1826
9.3
-
Add new Geo event logs for project deletions and renames -
Add push events to Geo event log: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1976
9.4 (July)
-
Support PostgreSQL replication slots -
Improve speed of cloning and replication (e.g. by using more parallel workers) -
Enable Geo log cursor
9.5 (August)
-
Unhide all refs from GitLab: https://gitlab.com/gitlab-org/gitlab-ee/issues/2959 -
Detach repository group and path name from disk: https://gitlab.com/gitlab-org/gitlab-ce/issues/28283 -
Deprecate system hooks: https://gitlab.com/gitlab-org/gitlab-ee/issues/2174#note_28319238 -
Group-level selective replication: https://gitlab.com/gitlab-org/gitlab-ee/issues/2224#note_33117061
10.0 (September)
-
Start testing Geo with GitLab.com: https://gitlab.com/gitlab-com/infrastructure/issues/2293, https://gitlab.com/gitlab-org/gitlab-ee/issues/1884 @jarv
-
Remove Geo system hooks: https://gitlab.com/gitlab-org/gitlab-ee/issues/3110 @to1ne
-
Instrument all project/file download times: https://gitlab.com/gitlab-org/gitlab-ee/issues/3020 @stanhu
-
Implement migration path from legacy to hash-bashed storage format: https://gitlab.com/gitlab-org/gitlab-ee/issues/3118 @brodock
10.1 (October)