Validate and document taking a backup and restoring on live GitLab instance

Problem to solve

GitLab has a number of different options for backing up data. Each option is has their advantages and disadvantages. A general theme we have noticed across all solutions is that they take increasingly longer to perform the backup as the customers data grows.

The documentation could provide better guidance on how which solution to use for different scenarios and how to configure them appropriately.

Some customers would like to take multiple backups a day. To start with we will set a goal of reaching an RTO of 6 hours.

We can start with an RPO of 3 hours.

Proposal

We want to provide customers with a simple set of options that map to their setup. We want the options to contain clear instructions on how to setup and configure the backup solution. We want to consolidate all the content around backup and restore into three simple options for customers.

The options will be delineated based on the volume of GitLab data the customer wants to backup and the reference architecture that customer's deployment is based on.

1K - 2K reference architecture with < 100GB GitLab data (Omnibus/Hybrid)

Backup rake task on an omnibus node

Setup and operation of this is mostly well documented. We will just need to organise the content so it's easy to find.

1K - 2K reference architecture with > 100GB GitLab data (Omnibus/Hybrid)

Due to the architecture and the volume of data we will need to split this into two operations. Both omnibus and hybrid architectures have omnibus nodes from which the backup rake task can be run.

Backup rake task (DB, Repo) on omnibus node
Object storage replication using providers replication services.

We need to validate and document this approach.

3K - 25K reference architecture (Omnibus/Hybrid)

Due to the architecture and the volume of data we will need to split this into three operations.

Incremental backups on omnibus node
Backup rake task (DB) on omnibus node
Object storage replication using providers replication services.

We need to validate and document this approach.

In all scenarios if a managed DB is in use, we recommend they use the providers DB backup tools to backup and restore the DB.

If we are not able to obtain a consistent backup that cannot be healed easily, we should consider exploring the option of taking the backup from Geo secondary. This should only be considered as a fallback for 3K+ architectures as the infrastructure overhead for the smaller architectures will not make sense.

Acceptance criteria

It should be possible to take a backup and restore the backup in less than 6 hours (RTO).
The restored backup should be fully operational with no faults that are not reparable with simplex fixes. If simple fixes are required, they should be clearly documented.
Documentation should make it simple for the customer to identify which category their setup falls into.
We should achieve an RPO of 3 hours.
Documentation should cover instructions for the two major object storage vendors (GCP and AWS) at least.

Considerations

How can we and subsequently the customer validate a restored backup? We should document this as this will be useful for customers as well.

Review

Product and/or eng. management, as well as technical ICs stakeholders from each group below, should review this proposal and offer their insights before we take any action.

Group	Eng stakeholder	PM/EM stakeholder
groupgeo	✔ `@mkozono`	✔ (`@sranasinghe` ✔ `@juan-silva`
groupdistribution		✔ Deploy-`@plu8`
groupgitaly	✔ `@proglottis`	✔ `@jcaigitlab`
Quality	✔ `@grantyoung`
Solutions Architrects

Edited Mar 22, 2023 by Peter Lu

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information