Update DR/Backups handbook pages (b55fdc56) · Commits · GitLab.com / Content Sites / handbook

.gitlab/CODEOWNERS

+2 −1

Original line number	Diff line number	Diff line
		@@ -716,6 +716,7 @@
		[Controlled-Documents]
		/content/handbook/business-technology/entapps-documentation/policies/gitlab-business-continuity-plan.md @brobins @james.shen @NabithaRao @gitlab-com/egroup @gitlab-com/content-sites
		/content/handbook/engineering/gitlab-com/policies/monitoring/ @marin @sabrinafarmer @gitlab-com/egroup @gitlab-com/content-sites
		/content/handbook/engineering/gitlab-com/policies/backup/ @marin @andrashorvath @gitlab-com/egroup @gitlab-com/content-sites
		/content/handbook/engineering/infrastructure/database/_index.md @marin @sabrinafarmer @vincywilson @gitlab-com/egroup @gitlab-com/content-sites
		/content/handbook/engineering/infrastructure/database/disaster-recovery.md @marin @andrashorvath @gitlab-com/egroup @gitlab-com/content-sites
		/content/handbook/engineering/infrastructure/production/_index.md @marin @sabrinafarmer @gitlab-com/egroup @gitlab-com/content-sites

content/handbook/engineering/architecture/design-documents/disaster_recovery/_index.md

+2 −0

Original line number	Diff line number	Diff line
		@@ -13,6 +13,8 @@ toc_hide: true
		This document is a work-in-progress and proposes architecture changes for the GitLab.com SaaS.
		The goal of these changes are to maintain GitLab.com service continuity in the case a regional or zonal outage.

		For the current state see [Disaster Recovery Policies for GitLab Backups](/handbook/engineering/gitlab-com/policies/backup/#disaster-recovery).

		- A zonal recovery is required when all resources are unavailable in one of the three availability zones in `us-east1` or `us-central1`.
		- A regional recovery is required when all resources become unavailable in one of the regions critical to operation of GitLab.com, either `us-east1` or `us-central1`.

content/handbook/engineering/gitlab-com/policies/backup/_index.md

0 → 100644

+82 −0

Original line number	Diff line number	Diff line
		---
		title: "Backups of GitLab.com"
		description: "This policy specifies requirements for backups of GitLab.com"
		controlled_document: true
		---

		## Purpose

		This policy outlines how GitLab performs, monitors, and validates backups and restorations of GitLab.com.
		These procedures are critical for ensuring data recovery and disaster recovery for customer data.

		## Scope

		GitLab.com's backup strategy includes both monitoring and restore validation.

		Customer data is stored in the following locations:

		1. All PostgreSQL databases for GitLab.com
		1. Object storage for GitLab.com, including packages, LFS, uploads, and CI data
		1. The CustomersDot database, which manages subscriptions and purchases
		1. Git repositories

		## Roles & Responsibilities

		\| Role \| Responsibility \|
		\|---------------------------\|--------------------------------------------------------------------------------\|
		\| GitLab Team Members \| Ensure adherence to the requirements outlined in this policy \|
		\| Engineering (Code Owners) \| Approve significant changes and exceptions to this policy \|

		## Procedure

		GitLab defines:

		- The services requiring backups
		- The frequency of backups, data retention periods, and restoration processes
		- The procedures for data restoration for Disaster Recovery scenarios

		### Backup and Restore

		#### PostgreSQL Databases

		GitLab.com database backups are taken every 24 hours, with incremental updates every 60 seconds. The data is securely streamed to [GCS](https://cloud.google.com/storage), encrypted, and retained for 90 days. The CustomersDot database is backed up daily with a retention period of 7 days.

		All databases are continuously monitored to ensure successful backups, with alerts triggered for missing backups. Restore processes are validated through regular restoration from disk snapshots and replaying WAL segments.

		#### Git Repositories

		Repositories are backed up hourly using block-level disk snapshots. These snapshots are stored in multi-region object storage and retained for 14 days. All disks are monitored, with alerts triggered for missing snapshots.

		Restore validation is conducted by randomly sampling disks and restoring recent snapshots.

		#### Object Storage

		Data stored in Object Storage (GCS) benefits from Google's [99.999999999% annual durability](https://cloud.google.com/storage/docs/storage-classes#descriptions) and multi-region bucket redundancy. To enhance data protection, [Object Versioning](https://cloud.google.com/storage/docs/object-versioning) and [Soft Delete](https://cloud.google.com/storage/docs/soft-delete) are enabled.

		Automated restore validation is not required for Object Storage due to its inherent protections through versioning and soft delete.

		### Disaster Recovery

		For disaster recovery, backups are validated through periodic restoration exercises called "Game Days" to ensure compliance with recovery time objective (RTO) and recovery point objective (RPO) targets.
		GitLab.com is deployed in the `us-east1` region across multiple GCP availability zones.
		For short-term outages affecting a single zone within `us-east1`, unaffected zones will scale to restore service.
		For the Gitaly service, recovery from backups will be necessary if data loss occurs.

		Disaster recovery operations adhere to the [Disaster Recovery runbooks](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/disaster-recovery).
		These procedures target specific services to allow parallelized recovery efforts.

		Mock disaster recovery (DR) events are conducted quarterly to simulate incidents affecting one or more services.
		These exercises validate DR processes and readiness for real incidents.

		During these [Game Days](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/disaster-recovery/gameday.md), RTO and RPO targets are validated by [recording measurements for each procedure](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/disaster-recovery/recovery-measurements.md).

		## Exceptions

		Exceptions to this policy will be managed in accordance with the [Information Security Policy Exception Management Process](/handbook/security/controlled-document-procedure/#exceptions).

		## References

		- [Information Security Policy](/handbook/security)
		- [Records Retention & Disposal](/handbook/security/records-retention-deletion/)
		- [Disaster Recovery runbooks](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/disaster-recovery)
		- [GameDays](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/disaster-recovery/gameday.md)

content/handbook/engineering/infrastructure/database/disaster-recovery.md

+1 −116

Original line number	Diff line number	Diff line
		@@ -3,119 +3,4 @@ title: "Database: Disaster Recovery"
		controlled_document: true
		---

		## Purpose

		This page contains an overview of the disaster recovery strategy we have
		in place for the PostgreSQL database. In this context, a disaster means
		losing the main database cluster or parts of it (a `DROP DATABASE`-type
		incident).

		The overview here is not complete and is going to be extended soon.

		We base our strategy on PostgreSQL's [Point-in-Time Recovery (PITR)](https://www.postgresql.org/docs/9.6/static/continuous-archiving.html) feature.

		This means we're shipping daily snapshots and transaction logs (WAL) to
		an external storage (the archive). Given a snapshot, we are now able to
		replay the WAL until a certain point in time is reached (for example,
		right before the disaster happened earlier).

		Currently, AWS S3 serves as a storage backend for the PITR archive.

		## Scope

		This handbook page applies to recovery of the GitLab PostgreSQL production database in a disaster scenario.

		## Roles & Responsibilities

		\| Role \| Responsibility\|
		\| ---- \| ------ \|
		\| Infrastructure Team \| Responsible for executing recovery of the production gitlab.com database in the event of a disaster \|
		\| Infrastructure Management (Code Owners) \| Responsible for approving significant changes and exceptions to this procedure \|

		## Procedure

		### Restore testing

		A backup is only worth something if it can be successfully restored in a
		certain amount of time. In order to monitor the state of backups and
		measure the expected recovery time (`DB-DR-TTR`), we employ a daily
		process to test the backups.

		This process is implemented as a CI pipeline (see
		[README.md](https://gitlab.com/gitlab-restore/postgres-gprd/blob/master/README.md)
		for details). On a daily schedule, a fresh database GCE instance is
		created that restores from the latest backup, gets configured as an
		archive replica that recovers from the WAL archive (essentially
		performing PITR). After this is complete, the restored database is
		verified.

		There is monitoring in place to detect problems with the restore
		pipeline (currently using [deadmanssnitch.com](https://deadmanssnitch.com)).
		We plan to monitor the time it takes to recover and other metrics soon.

		### Disaster Recovery Replicas

		The backup strategy above is a cold backup. In order to restore from a
		cold backup, we need to retrieve the full backup from a cold storage
		(via network) and perform PITR from it. This can take quite some time
		considering the amount of data needed to be put on the network.

		The current speed of restoring a cold backup from AWS S3 is at about 380
		GB per hour (net size) for retrieving the base backup. With a database
		size of currently 2.1 TB, just retrieving the base backup alone takes
		more than 5 hours already. The PITR phase is generally deemed slower.

		We currently aim at a `DB-DR-TTR` of 8 hours to recover from a backup.
		We're not there yet and as an interim measure, we introduce disaster
		recovery replicas.

		#### Delayed Replica

		Another option is to have a replica in place that always lags a few
		hours behind the production cluster. We call this a delayed replica: It
		is a normal streaming replica but delayed by a few hours. In case
		disaster strikes, it can be used to quickly perform PITR from the WAL
		archive. This is much faster than a full restore, because we don't have
		to fully retrieve a full backup from S3. Additionally, with daily
		snapshots the latest snapshot is 24 hours (plus the time it took to
		capture the snapshot) old worst-case. A delayed replica is constantly
		kept at a certain offset with respect to the production cluster and hence
		does not need to replay too many hours worth of data.

		* Production host: `postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal`
		* Chef role: `gprd-base-db-postgres-delayed`

		#### Archive Replica

		Another type of replica is an archive replica. It's sole purpose is to
		continuously recover from the WAL archive and hence test the WAL
		archive. This is necessary because PITR relies on a continuous sequence
		of WAL that can be applied to a snapshot of the database (basebackup).
		If that sequence is broken for whatever reason, PITR can only recover
		until this point and no further. We monitor the replication lag of the
		archive replica. If it falls back too far, there's likely a problem with
		the WAL archive.

		The restore testing pipeline also performs PITR from the WAL archive and
		thus also would be able to detect (some) problems with the archive. However,
		employing an archive replica that is close to the production cluster
		helps to detect problems with the archive much faster than with a daily
		test of a backup. Also, the archive replica has to consume all WAL from
		the archive - a backup restore is likely to only read a portion of the
		archive to recover to a certain point in time.

		In that sense, there is overlap between functionality of archive and
		delayed replicas and the restore testing. Together it gives us high
		confidence in our cold backup and PITR recovery strategy.

		* Production host: `postgres-dr-archive-01-db-gprd.c.gitlab-production.internal`
		* Chef role: `gprd-base-db-postgres-archive`

		## Exceptions

		Exceptions to this procedure will be tracked as per the [Information Security Policy Exception Management Process](/handbook/security/controlled-document-procedure/#exceptions).

		## References

		* Parent Policy: [Information Security Policy](/handbook/security/)
		* [Controlled Document Procedure](/handbook/security/controlled-document-procedure/)
		Moved to [Disaster Recovery Policies for GitLab Backups](/handbook/engineering/gitlab-com/policies/backup/#disaster-recovery)

content/handbook/engineering/infrastructure/production/_index.md

+2 −37

Original line number	Diff line number	Diff line
		@@ -129,44 +129,9 @@ For some incidents, we may figure out that the usage patterns that led to the is
		1. The definition of abuse can be found on the [security abuse operations section of the handbook](/handbook/security/)
		1. In the event of an incident affecting GitLab.com availability, the SRE team may take actions immediately to keep the system available. However, the team must also immediately involve our security abuse team. A new [security on call rotation](/handbook/security/security-operations/sirt/engaging-security-on-call/) has been established in PagerDuty - There is a Security Responder rotation which can be alerted along with a Security Manager rotation.

		## Backups
		## Backup and Restore

		### Purpose

		This section is part of [controlled document](/handbook/security/controlled-document-procedure/) covering our controls for backups. It covers BCD-11 in [the controls](/handbook/security/security-assurance/security-compliance/guidance/business-continuity-and-disaster-recovery/).

		### Scope

		Production database backups

		### Roles & Responsibilities

		\| Role \| Responsibility \|
		\|-----------\|-----------\|
		\| Infrastructure Team \| Responsible for configuration and management \|
		\| Infrastructure Management (Code Owners) \| Responsible for approving significant changes and exceptions to this procedure \|

		### Procedure

		Backups of our production databases are taken every 24 hours with continuous incremental data (at 60 sec intervals), streamed into [GCS](https://cloud.google.com/storage). These backups are encrypted, and follow the lifecycle:

		- Initial 7 days in [Multi-regional](https://cloud.google.com/storage/docs/storage-classes#standard) storage class.
		- After 7 days migrated to [Coldline](https://cloud.google.com/storage/docs/storage-classes#coldline) storage class.
		- After 90 days, backups are deleted.
		- Snapshots of non Patroni-managed database (e.g. PostgreSQL DR replicas) and non-database (e.g. Gitaly, Redis, Prometheus) data filesystems are taken every hour and kept for at least 7 days.
		- Snapshots of Patroni-managed databases (a designated replica, in fact) are taken every 6 hours and kept for 7 days.

		Data stored in Object Storage (GCS) such as artifacts, the container registry, and others have no additional backups, relying on the [99.999999999% annual durability](https://cloud.google.com/storage/docs/storage-classes#descriptions) and multi-region buckets.

		For details see the runbooks, particularly for [GCP snapshots](https://gitlab.com/gitlab-com/runbooks/blob/master/docs/uncategorized/gcp-snapshots.md) and [Database backups using WAL-E/WAL-G (encrypted)](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/postgresql-backups-wale-walg.md)

		### Exceptions

		Exceptions to this backup policy will be tracked in the [compliance issue tracker](https://gitlab.com/gitlab-com/gl-security/security-assurance/team-commercial-compliance/compliance/-/issues/).

		### References

		- Parent Policy: [Information Security Policy](/handbook/security/)
		See policies for [Backup and Restore](/handbook/engineering/gitlab-com/policies/backup).

		## Patching

content/handbook/company/working-groups/disaster-recovery.md

+1 −1

File changed.

Contains only whitespace changes.