[Meta] Disaster recovery for everything that is not the database
Given our current experience we need to work in ironing out the following cases to be in full control of the infrastructure
-
1. Git files gone: one thing is having snapshots, a different thing is having a backup that we can recover from. (https://gitlab.com/gitlab-com/infrastructure/issues/1606) -
2. Shared files gone: same case as the previous - we need to have a way of recovering from this. (https://gitlab.com/gitlab-com/infrastructure/issues/1606) -
3. Hacker in system, need to reset all credentials: we should move all our credential handling into Vault (https://gitlab.com/gitlab-com/infrastructure/issues/1212) -
4. Package server hacked: recover and confirm that no package has been modified - (https://gitlab.com/gitlab-com/infrastructure/issues/1252) -
5. Pushing a package that causes data loss: detection and a measured response time. (https://gitlab.com/gitlab-com/infrastructure/issues/1264) -
6. Multiple forms of backups for the database including disk snapshotting for quick recovery (https://gitlab.com/gitlab-com/infrastructure/issues/1152 - https://gitlab.com/gitlab-com/infrastructure/issues/1251) -
7. Staging access should be only available for people with production access. (gitlab-com/www-gitlab-com!5123 (merged) - https://gitlab.com/gitlab-com/infrastructure/issues/1231) -
8. Remove production access from release managers. -
9. All production engineers use Yubikey for enhanced security (https://gitlab.com/gitlab-com/infrastructure/issues/1250) -
10. Secondary servers (version, etc) disaster recovery plan. (https://gitlab.com/gitlab-com/infrastructure/issues/1239 and https://gitlab.com/gitlab-com/infrastructure/issues/1239) -
11. Production readiness checklist, including disaster recovery - nothing gets in production if we don't have a backup plan. (https://gitlab.com/gitlab-com/infrastructure/issues/1240) -
12. Assign a data durability owner that will be responsible for making sure that things work. (gitlab-com/www-gitlab-com!4950 (merged)) -
13. Massive DDOS attack(not disaster recovery but availability) (https://gitlab.com/gitlab-com/infrastructure/issues/1284) -
14. Azure Region Down (https://gitlab.com/gitlab-com/infrastructure/issues/1285)
Edited by Pablo Carranza [GitLab]