[META] Restore appreciation day further improvements
With database restores finally automated, I would like to discuss the whole direction the restore procedures will take from now. Currently, its in "make it work" stage, so to speak, and there's still a lot to be done to reach "make it right" and "make it fast" stages. Here's the plan of what I'm seeing is left:
-
MR1 MR2 MR3 MR4 Simplify secondaries restore to use manual jobs and triggering of verify
step via API (this is later required from production database due to its size) -
Create separate system user on GitLab.com to own the pipelines and document how it was done -
Switch rest of the restoration pipelines to be owned by this non-personalized account and use async API call to start manual tasks: -
Formalize threat model for gitlab-restore
project at GCP. Key points: MR- clearly define assets (like, I'd say keys themselves are not assets, but data is)
- define scope (keep it simple and contained to the project only)
- define countermeasures and use as a foundation for next steps
-
Document the threat model with countermeasures, have security team take a nice, long look at it -
According to the above, set up missing pieces of the puzzle: -
repo settings check and monitoring MR -
project settings check and monitoring (answer: are we compliant with the active requirements) -
add necessary service accounts, document their permissions and credential rotation runbook and schedule -
implement KMS rotation and test that automated procedures are compatible with it
-
-
Formalize threat model for gitlab-backup-data
project at GCP. Key points: MR- define assets and scope
- aim for "write from everywhere, read from
gitlab-restore
only" model of operation - aim for having only GPG encrypted data inside, w/o access to the GPG decryption keys
-
Create the gitlab-backup-data
project at GCP, link it to billing. -
According to the above, set up missing parts for gitlab-backup-data
:-
Documentation on how it was created. -
project settings check checks MR -
add necessary service accounts and document their permissions
-
-
With gitlab-backup-data
live, reroute WAL-E chunks from S3 to GCP -
With wal-e chunks on GCP, enable daily production database restore -
Meet OKR of sub-hour production database restore -
Implement GPG key rotation procedures -
Support different GPG keys in different projects
-
-
Decouple the restore pipeline from data warehouse input pipeline -
Enable automatic instance cleanup to reduce costs
-
-
Implement automatic verification procedures for each restored service -
Create Grafana dashboard with restore status and time taken to restore, set up alerts on failures -
Formalize DR tests, if needed -
Update handbook restore section reflecting the above changes -
Continuously expand the whole thing to include other (non-database) services backups -
Continuously update documentation
List of things that do need rotation procedures to track them in one place:
-
Access token for gitlab-restore-bot@gitlab.com user, used in two places: - as a secret CI/CD variable (can I access api from within the job?)
- as an encrypted blob in gitlab-restore GCS
-
KMS keys at gitlab-restore@GCP
And probably a lot more. In my head, this is already becoming a boring, yet important project, as opposed to the assorted set of tasks we have now -- basically owning the backup/restore procedure of every service worthy of restore, and we need to formalize it. Currently we have only https://about.gitlab.com/handbook/infrastructure/production/#backups, and its already obsolete.
Thoughts? Suggestions? Please dump them here.
/cc @gl-infra @edjdev