Validate and document taking a backup and restoring on live GitLab instance
Problem to solve
GitLab has a number of different options for backing up data. Each option is has their advantages and disadvantages. A general theme we have noticed across all solutions is that they take increasingly longer to perform the backup as the customers data grows.
The documentation could provide better guidance on how which solution to use for different scenarios and how to configure them appropriately.
Some customers would like to take multiple backups a day. To start with we will set a goal of reaching an RTO of 6 hours.
We can start with an RPO of 3 hours.
Proposal
We want to provide customers with a simple set of options that map to their setup. We want the options to contain clear instructions on how to setup and configure the backup solution. We want to consolidate all the content around backup and restore into three simple options for customers.
The options will be delineated based on the volume of GitLab data the customer wants to backup and the reference architecture that customer's deployment is based on.
1K - 2K reference architecture with < 100GB GitLab data (Omnibus/Hybrid)
- Backup rake task on an omnibus node
Setup and operation of this is mostly well documented. We will just need to organise the content so it's easy to find.
1K - 2K reference architecture with > 100GB GitLab data (Omnibus/Hybrid)
Due to the architecture and the volume of data we will need to split this into two operations. Both omnibus and hybrid architectures have omnibus nodes from which the backup rake task can be run.
- Backup rake task (DB, Repo) on omnibus node
- Object storage replication using providers replication services.
We need to validate and document this approach.
3K - 25K reference architecture (Omnibus/Hybrid)
Due to the architecture and the volume of data we will need to split this into three operations.
- Incremental backups on omnibus node
- Backup rake task (DB) on omnibus node
- Object storage replication using providers replication services.
We need to validate and document this approach.
In all scenarios if a managed DB is in use, we recommend they use the providers DB backup tools to backup and restore the DB.
If we are not able to obtain a consistent backup that cannot be healed easily, we should consider exploring the option of taking the backup from Geo secondary. This should only be considered as a fallback for 3K+ architectures as the infrastructure overhead for the smaller architectures will not make sense.
Acceptance criteria
- It should be possible to take a backup and restore the backup in less than 6 hours (RTO).
- The restored backup should be fully operational with no faults that are not reparable with simplex fixes. If simple fixes are required, they should be clearly documented.
- Documentation should make it simple for the customer to identify which category their setup falls into.
- We should achieve an RPO of 3 hours.
- Documentation should cover instructions for the two major object storage vendors (GCP and AWS) at least.
Considerations
- How can we and subsequently the customer validate a restored backup? We should document this as this will be useful for customers as well.
Review
Product and/or eng. management, as well as technical ICs stakeholders from each group below, should review this proposal and offer their insights before we take any action.
Group | Eng stakeholder | PM/EM stakeholder |
---|---|---|
groupgeo |
@mkozono
|
@sranasinghe @juan-silva
|
groupdistribution |
@plu8
|
|
groupgitaly |
@proglottis
|
@jcaigitlab
|
Quality |
@grantyoung
|
|
Solutions Architrects |