Commit b55fdc56 authored by Shreya Shah's avatar Shreya Shah Committed by Marin Jankovski
Browse files

Update DR/Backups handbook pages

parent 8c65623a
Loading
Loading
Loading
Loading
+2 −1
Original line number Diff line number Diff line
@@ -716,6 +716,7 @@
[Controlled-Documents]
/content/handbook/business-technology/entapps-documentation/policies/gitlab-business-continuity-plan.md @brobins @james.shen @NabithaRao @gitlab-com/egroup @gitlab-com/content-sites
/content/handbook/engineering/gitlab-com/policies/monitoring/ @marin @sabrinafarmer @gitlab-com/egroup @gitlab-com/content-sites
/content/handbook/engineering/gitlab-com/policies/backup/ @marin @andrashorvath @gitlab-com/egroup @gitlab-com/content-sites
/content/handbook/engineering/infrastructure/database/_index.md @marin @sabrinafarmer @vincywilson @gitlab-com/egroup @gitlab-com/content-sites
/content/handbook/engineering/infrastructure/database/disaster-recovery.md @marin @andrashorvath @gitlab-com/egroup @gitlab-com/content-sites
/content/handbook/engineering/infrastructure/production/_index.md @marin @sabrinafarmer @gitlab-com/egroup @gitlab-com/content-sites
+2 −0
Original line number Diff line number Diff line
@@ -13,6 +13,8 @@ toc_hide: true
This document is a work-in-progress and proposes architecture changes for the GitLab.com SaaS.
The goal of these changes are to maintain GitLab.com service continuity in the case a regional or zonal outage.

For the current state see [Disaster Recovery Policies for GitLab Backups](/handbook/engineering/gitlab-com/policies/backup/#disaster-recovery).

- A **zonal recovery** is required when all resources are unavailable in one of the three availability zones in `us-east1` or `us-central1`.
- A **regional recovery** is required when all resources become unavailable in one of the regions critical to operation of GitLab.com, either `us-east1` or `us-central1`.

+82 −0
Original line number Diff line number Diff line
---
title: "Backups of GitLab.com"
description: "This policy specifies requirements for backups of GitLab.com"
controlled_document: true
---

## Purpose

This policy outlines how GitLab performs, monitors, and validates backups and restorations of GitLab.com.
These procedures are critical for ensuring data recovery and disaster recovery for customer data.

## Scope

GitLab.com's backup strategy includes both monitoring and restore validation.

Customer data is stored in the following locations:

1. All PostgreSQL databases for GitLab.com
1. Object storage for GitLab.com, including packages, LFS, uploads, and CI data
1. The CustomersDot database, which manages subscriptions and purchases
1. Git repositories

## Roles & Responsibilities

| Role                      | Responsibility                                                                 |
|---------------------------|--------------------------------------------------------------------------------|
| GitLab Team Members       | Ensure adherence to the requirements outlined in this policy                   |
| Engineering (Code Owners) | Approve significant changes and exceptions to this policy                      |

## Procedure

GitLab defines:

- The services requiring backups
- The frequency of backups, data retention periods, and restoration processes
- The procedures for data restoration for Disaster Recovery scenarios

### Backup and Restore

#### PostgreSQL Databases

GitLab.com database backups are taken every 24 hours, with incremental updates every 60 seconds. The data is securely streamed to [GCS](https://cloud.google.com/storage), encrypted, and retained for 90 days. The CustomersDot database is backed up daily with a retention period of 7 days.

All databases are continuously monitored to ensure successful backups, with alerts triggered for missing backups. Restore processes are validated through regular restoration from disk snapshots and replaying WAL segments.

#### Git Repositories

Repositories are backed up hourly using block-level disk snapshots. These snapshots are stored in multi-region object storage and retained for 14 days. All disks are monitored, with alerts triggered for missing snapshots.

Restore validation is conducted by randomly sampling disks and restoring recent snapshots.

#### Object Storage

Data stored in Object Storage (GCS) benefits from Google's [99.999999999% annual durability](https://cloud.google.com/storage/docs/storage-classes#descriptions) and multi-region bucket redundancy. To enhance data protection, [Object Versioning](https://cloud.google.com/storage/docs/object-versioning) and [Soft Delete](https://cloud.google.com/storage/docs/soft-delete) are enabled.

Automated restore validation is not required for Object Storage due to its inherent protections through versioning and soft delete.

### Disaster Recovery

For disaster recovery, backups are validated through periodic restoration exercises called "Game Days" to ensure compliance with recovery time objective (RTO) and recovery point objective (RPO) targets.
GitLab.com is deployed in the `us-east1` region across multiple GCP availability zones.
For short-term outages affecting a single zone within `us-east1`, unaffected zones will scale to restore service.
For the Gitaly service, recovery from backups will be necessary if data loss occurs.

Disaster recovery operations adhere to the [Disaster Recovery runbooks](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/disaster-recovery).
These procedures target specific services to allow parallelized recovery efforts.

Mock disaster recovery (DR) events are conducted quarterly to simulate incidents affecting one or more services.
These exercises validate DR processes and readiness for real incidents.

During these [Game Days](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/disaster-recovery/gameday.md), RTO and RPO targets are validated by [recording measurements for each procedure](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/disaster-recovery/recovery-measurements.md).

## Exceptions

Exceptions to this policy will be managed in accordance with the [Information Security Policy Exception Management Process](/handbook/security/controlled-document-procedure/#exceptions).

## References

- [Information Security Policy](/handbook/security)
- [Records Retention & Disposal](/handbook/security/records-retention-deletion/)
- [Disaster Recovery runbooks](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/disaster-recovery)
- [GameDays](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/disaster-recovery/gameday.md)
+1 −116
Original line number Diff line number Diff line
@@ -3,119 +3,4 @@ title: "Database: Disaster Recovery"
controlled_document: true
---

## Purpose

This page contains an overview of the disaster recovery strategy we have
in place for the PostgreSQL database. In this context, a disaster means
losing the main database cluster or parts of it (a `DROP DATABASE`-type
incident).

The overview here is not complete and is going to be extended soon.

We base our strategy on PostgreSQL's [Point-in-Time Recovery (PITR)](https://www.postgresql.org/docs/9.6/static/continuous-archiving.html) feature.

This means we're shipping daily snapshots and transaction logs (WAL) to
an external storage (the archive). Given a snapshot, we are now able to
replay the WAL until a certain point in time is reached (for example,
right before the disaster happened earlier).

Currently, AWS S3 serves as a storage backend for the PITR archive.

## Scope

This handbook page applies to recovery of the GitLab PostgreSQL production database in a disaster scenario.

## Roles & Responsibilities

| Role | Responsibility|
| ---- | ------ |
| Infrastructure Team | Responsible for executing recovery of the production gitlab.com database in the event of a disaster |
| Infrastructure Management (Code Owners) | Responsible for approving significant changes and exceptions to this procedure |

## Procedure

### Restore testing

A backup is only worth something if it can be successfully restored in a
certain amount of time. In order to monitor the state of backups and
measure the expected recovery time (`DB-DR-TTR`), we employ a daily
process to test the backups.

This process is implemented as a CI pipeline (see
[README.md](https://gitlab.com/gitlab-restore/postgres-gprd/blob/master/README.md)
for details). On a daily schedule, a fresh database GCE instance is
created that restores from the latest backup, gets configured as an
archive replica that recovers from the WAL archive (essentially
performing PITR). After this is complete, the restored database is
verified.

There is monitoring in place to detect problems with the restore
pipeline (currently using [deadmanssnitch.com](https://deadmanssnitch.com)).
We plan to monitor the time it takes to recover and other metrics soon.

### Disaster Recovery Replicas

The backup strategy above is a *cold backup*. In order to restore from a
cold backup, we need to retrieve the full backup from a cold storage
(via network) and perform PITR from it. This can take quite some time
considering the amount of data needed to be put on the network.

The current speed of restoring a cold backup from AWS S3 is at about 380
GB per hour (net size) for retrieving the base backup. With a database
size of currently 2.1 TB, just retrieving the base backup alone takes
more than 5 hours already. The PITR phase is generally deemed slower.

We currently aim at a `DB-DR-TTR` of 8 hours to recover from a backup.
We're *not there yet* and as an interim measure, we introduce disaster
recovery replicas.

#### Delayed Replica

Another option is to have a replica in place that always lags a few
hours behind the production cluster. We call this a *delayed replica*: It
is a normal streaming replica but delayed by a few hours. In case
disaster strikes, it can be used to quickly perform PITR from the WAL
archive. This is much faster than a full restore, because we don't have
to fully retrieve a full backup from S3. Additionally, with daily
snapshots the latest snapshot is 24 hours (plus the time it took to
capture the snapshot) old worst-case. A delayed replica is constantly
kept at a certain offset with respect to the production cluster and hence
does not need to replay too many hours worth of data.

* Production host: `postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal`
* Chef role: `gprd-base-db-postgres-delayed`

#### Archive Replica

Another type of replica is an *archive replica*. It's sole purpose is to
continuously recover from the WAL archive and hence *test* the WAL
archive. This is necessary because PITR relies on a continuous sequence
of WAL that can be applied to a snapshot of the database (basebackup).
If that sequence is broken for whatever reason, PITR can only recover
until this point and no further. We monitor the replication lag of the
archive replica. If it falls back too far, there's likely a problem with
the WAL archive.

The restore testing pipeline also performs PITR from the WAL archive and
thus also would be able to detect (some) problems with the archive. However,
employing an archive replica that is close to the production cluster
helps to detect problems with the archive much faster than with a daily
test of a backup. Also, the archive replica has to consume all WAL from
the archive - a backup restore is likely to only read a portion of the
archive to recover to a certain point in time.

In that sense, there is overlap between functionality of archive and
delayed replicas and the restore testing. Together it gives us high
confidence in our cold backup and PITR recovery strategy.

* Production host: `postgres-dr-archive-01-db-gprd.c.gitlab-production.internal`
* Chef role: `gprd-base-db-postgres-archive`

## Exceptions

Exceptions to this procedure will be tracked as per the [Information Security Policy Exception Management Process](/handbook/security/controlled-document-procedure/#exceptions).

## References

* Parent Policy: [Information Security Policy](/handbook/security/)
* [Controlled Document Procedure](/handbook/security/controlled-document-procedure/)
Moved to [Disaster Recovery Policies for GitLab Backups](/handbook/engineering/gitlab-com/policies/backup/#disaster-recovery)
+2 −37
Original line number Diff line number Diff line
@@ -129,44 +129,9 @@ For some incidents, we may figure out that the usage patterns that led to the is
1. The definition of abuse can be found on the [security abuse operations section of the handbook](/handbook/security/)
1. In the event of an incident affecting GitLab.com availability, the SRE team may take actions immediately to keep the system available.  However, the team must also immediately involve our security abuse team.  A new [security on call rotation](/handbook/security/security-operations/sirt/engaging-security-on-call/) has been established in PagerDuty - There is a Security Responder rotation which can be alerted along with a Security Manager rotation.

## Backups
## Backup and Restore

### Purpose

This section is part of [controlled document](/handbook/security/controlled-document-procedure/) covering our controls for backups.  It covers BCD-11 in [the controls](/handbook/security/security-assurance/security-compliance/guidance/business-continuity-and-disaster-recovery/).

### Scope

Production database backups

### Roles & Responsibilities

| Role  | Responsibility |
|-----------|-----------|
| Infrastructure Team | Responsible for configuration and management |
| Infrastructure Management (Code Owners) | Responsible for approving significant changes and exceptions to this procedure |

### Procedure

Backups of our production databases are taken every 24 hours with continuous incremental data (at 60 sec intervals), streamed into [GCS](https://cloud.google.com/storage). These backups are encrypted, and follow the lifecycle:

- Initial 7 days in [Multi-regional](https://cloud.google.com/storage/docs/storage-classes#standard) storage class.
- After 7 days migrated to [Coldline](https://cloud.google.com/storage/docs/storage-classes#coldline) storage class.
- After 90 days, backups are deleted.
- Snapshots of non Patroni-managed database (e.g. PostgreSQL DR replicas) and non-database (e.g. Gitaly, Redis, Prometheus) data filesystems are taken every hour and kept for at least 7 days.
- Snapshots of Patroni-managed databases (a designated replica, in fact) are taken every 6 hours and kept for 7 days.

Data stored in Object Storage (GCS) such as artifacts, the container registry, and others have no additional backups, relying on the [99.999999999% annual durability](https://cloud.google.com/storage/docs/storage-classes#descriptions) and multi-region buckets.

For details see the runbooks, particularly for [GCP snapshots](https://gitlab.com/gitlab-com/runbooks/blob/master/docs/uncategorized/gcp-snapshots.md) and [Database backups using WAL-E/WAL-G (encrypted)](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/postgresql-backups-wale-walg.md)

### Exceptions

Exceptions to this backup policy will be tracked in the [compliance issue tracker](https://gitlab.com/gitlab-com/gl-security/security-assurance/team-commercial-compliance/compliance/-/issues/).

### References

- Parent Policy: [Information Security Policy](/handbook/security/)
See policies for [Backup and Restore](/handbook/engineering/gitlab-com/policies/backup).

## Patching

+1 −1

File changed.

Contains only whitespace changes.

Loading