Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

GitLab.com
GitLab Infrastructure Team
Production Engineering
Issues
#3222

[META] Restore appreciation day further improvements

With database restores finally automated, I would like to discuss the whole direction the restore procedures will take from now. Currently, its in "make it work" stage, so to speak, and there's still a lot to be done to reach "make it right" and "make it fast" stages. Here's the plan of what I'm seeing is left:

MR1 MR2 MR3 MR4 Simplify secondaries restore to use manual jobs and triggering of verify step via API (this is later required from production database due to its size)
Create separate system user on GitLab.com to own the pipelines and document how it was done
Switch rest of the restoration pipelines to be owned by this non-personalized account and use async API call to start manual tasks:
- version.gitlab.com MR
- customers.gitlab.com MR
- db3.cluster.gitlab.com MR
Formalize threat model for gitlab-restore project at GCP. Key points: MR
- clearly define assets (like, I'd say keys themselves are not assets, but data is)
- define scope (keep it simple and contained to the project only)
- define countermeasures and use as a foundation for next steps
Document the threat model with countermeasures, have security team take a nice, long look at it
According to the above, set up missing pieces of the puzzle:
- repo settings check and monitoring MR
- project settings check and monitoring (answer: are we compliant with the active requirements)
- add necessary service accounts, document their permissions and credential rotation runbook and schedule
- implement KMS rotation and test that automated procedures are compatible with it
Formalize threat model for gitlab-backup-data project at GCP. Key points: MR
- define assets and scope
- aim for "write from everywhere, read from gitlab-restore only" model of operation
- aim for having only GPG encrypted data inside, w/o access to the GPG decryption keys
Create the gitlab-backup-data project at GCP, link it to billing.
According to the above, set up missing parts for gitlab-backup-data:
- Documentation on how it was created.
- project settings check checks MR
- add necessary service accounts and document their permissions
With gitlab-backup-data live, reroute WAL-E chunks from S3 to GCP
With wal-e chunks on GCP, enable daily production database restore
Meet OKR of sub-hour production database restore
Implement GPG key rotation procedures
- Support different GPG keys in different projects
Decouple the restore pipeline from data warehouse input pipeline
- Enable automatic instance cleanup to reduce costs
Implement automatic verification procedures for each restored service
Create Grafana dashboard with restore status and time taken to restore, set up alerts on failures
Formalize DR tests, if needed
Update handbook restore section reflecting the above changes
Continuously expand the whole thing to include other (non-database) services backups
Continuously update documentation

List of things that do need rotation procedures to track them in one place:

Access token for gitlab-restore-bot@gitlab.com user, used in two places:
- as a secret CI/CD variable (can I access api from within the job?)
- as an encrypted blob in gitlab-restore GCS
KMS keys at gitlab-restore@GCP

And probably a lot more. In my head, this is already becoming a boring, yet important project, as opposed to the assorted set of tasks we have now -- basically owning the backup/restore procedure of every service worthy of restore, and we need to formalize it. Currently we have only https://about.gitlab.com/handbook/infrastructure/production/#backups, and its already obsolete.

Thoughts? Suggestions? Please dump them here.

/cc @gl-infra @edjdev

Edited Dec 16, 2017 by Ilya Frolov

Assignee

Select assignees

Time tracking