Create emergency runbook for Data Team data ingestion

Summary

We've had 4 incidents over the last 12 months with the ZFS-cloned database the Data Team uses to perform data extraction for the data warehouse. These of these were easily recoverable, but the last one ended up in a multi-day outage.

Related Incident(s)

Originating issue(s): gitlab-com/gl-infra/production#6733*
Also: gitlab-com/gl-infra/reliability#15547, gitlab-com/gl-infra/reliability#15574

Desired Outcome/Acceptance Criteria

Re-provisioning a ZFS-based replica can take hours, and it's highly sensitive to the amount of WALs generated. Depending on the problem, the lag between data in the data warehouse and in production may strech too long and have significantly negative effects on business data.

In order to provide a bridge so as to minimize lag, we have the ability to build a temporary database from a GCP snapshot, as was done on https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15547. We should turn that process into a runbook we can use in case of emergency.

Also

After the creation of a temporary database through a GCP snapshots, pg_last_xact_replay_timestamp(); would sometimes not return a value, a behavior we had never observed with ZFS clones. We need to investigate why this happens (as otherwise the copy is not really usable).

Associated Services

Corrective Action Issue Checklist

Link the incident(s) this corrective action arose out of
Give context for what problem this corrective action is trying to prevent from re-occurring
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
Assign a priority (this will default to 'priority::4')

Edited Apr 11, 2022 by Gerardo Lopez-Fernandez