GitLab Silent Mode for Backup / Disaster Recovery Testing
Introduction
In order to ensure that a backup has worked correctly, it's important to perform routine recovery testing. Likewise, when using GitLab Geo, for disaster recovery scenarios, a full recovery test helps ensure that procedures are working as expected.
However, when performing recovery testing on a GitLab backup, one of the issues faced is that the new "recovery test" environment may believe that it's production, and send out emails, webhooks, push mirrors, or communicate with connected Kubernetes servers (deprecated). This adds risk and complexity to recovery procedures. The usual approach to addressing this is to use firewall rules to block outbound traffic, but this has many problems:
- It's easy to get firewall rules wrong, miss a required rule, leading to a "leak". This has lead to numerous incidents on GitLab.com, leading to customers receiving emails etc.
- Outbound services may use Cloud Service APIs, which can be difficult to distinguish using firewall rules. Blocking the entire cloud service API may lead to unexpected results which may impact the recovery test.
- Features such as SAML/LDAP/OAuth authentication may still require outbound connections and may require firewall changes, or be disabled, further complicating the testing
- Dogfooding: all customers can benefit from the feature, including Standalone Omnibus and GET, GitLab Dedicated, and GitLab.com. Firewall rules are specific to a single instance.
- Some features that perform outbound communications may fail and keep retrying, possibly interfering with the recovery test.
Proposal
- Add a silent mode toggle (default off) in the GitLab application settings. This feature is enabled, no outbound communications will emanate from the GitLab instance, allowing recovery testing.
- Any feature that performs outbound communications should check the application setting before performing the task, and silently and, when possible - successfully, return. This includes (but not limited to)
- Outbound Emails
- Outbound Webhooks
- Push Mirroring and pull mirroring
- Deprecated Kubernetes Connections (not needed for KAS)
- GitLab Alerting Notifications
- Slack webhooks, other services
Since these checks can be added at a fairly low level, it's likely that a small set of gateways within the codebase could provide total coverage.
Benefits
Having this toggle in place would allow for easier, less complex and more predictable automated recovery testing: something that is being discussed for GitLab Dedicated. It would also bring benefit for GitLab.com. Additionally, it would make recovery testing much easier for self-managed instances to perform backup and Geo DR recovery failover practice.
References
- (Limited access) description of how GitLab Dedicated could use this feature to perform routine automated DR testing: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/merge_requests/237#note_1058087852
cc @awthomas @joshlambert @glopezfernandez @marin @o-lluch @mkozono @juan-silva @jarv