Provide developers with access to production-like data
### Executive Summary We are evaluating Database Labs EE (DBLEE) from Postgres.ai as a long term solution for our ability to provide production-like data for testing earlier in the SDLC (shift-left). This is a typical build vs. buy decision. We believe that the features available in DBLEE will offer an immediate option for manual testing migrations and offers long term solutions for automating tests and keeping our Staging environment more similar to Production. #### Goals ##### Short Term - Expand some usage of DBLEE to all developers - GUI access (as opposed to joe-bot) - User spaces - Session isolation - Session History - Plan and query visualization - Thin cloning only available to those who already have production access - Codify and require database maintainers to run data migrations against thin clones - Catch migration problems prior to staging or production - Reduce support and escalation costs #### Long Term - Automate migration testing in CI prior to staging deployment - Remove the need for manual testing by maintainers - Use anonymization and masking features of DBLEE to provide production-like data for staging - Pending security and compliance review and approval ### Overview Recently there have been a number of requests from the development team to have access to production like data for testing prior to deployment. There are a few reasons why this would be useful for developers. Right now it is too expensive to fail in the staging environment since failures cause escalation events. There have been a number of database migrations that have failed in staging recently. Even then, the current size of staging is only approximately 1/10th of production, so some migrations that have passed staging can still fail in production. If developers had a way to test their migrations prior to deployment this would reduce the number of failures in staging and production. Additionally, the memory team is researching how and where to reduce the number of cached SQL calls. It is difficult to replicate the production environment to be able to validate assumptions, test and measure improvements of these changes. Again, having access to production-like data would help. ### Priorities - Improve developers ability to test migrations with production-like data prior to deploying to staging or production - Give developers safe access to production like data for developer testing and validation ### Plan We can address the goals set in this epic in three phases: #### Phase 1 - Manual testing against production data Available only to database maintainers (users with permission to access production data) [1. Setup single user server for testing migrations](https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/103) --> [2. Provide access to server for testing migrations to all maintainers](https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/106) Once this phase is completed, we'll be able to manually test all database migrations, including background migrations, against a fresh clone from production. It can be accompanied by an update to our database reviewing guidelines, so that all database migrations that can cause incidents (data migrations, background migrations) are manually tested against a production clone before an MR is approved. #### Phase 2 - Use anonymization to extend migration testing to more environments and team members [1. Process to identify and anonymize PII information](https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/95) --> [2. Anonymized production clones](https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/107) --> [3. Anonymized staging](https://gitlab.com/gitlab-org/database-team/team-tasks/-/issues/96) Step 2 above together with the results from Phase 1, could be used to provide access to all GitLab Team members to anonymized production data. Step 3 could be used so that all migrations and tests that run in staging (or other environments) run against a fresh anonymized clone of production data ### Phase 3 - Automate testing In this phase we automate migration testing in CI prior to staging deployment, by creating a private CI Runner to test migrations automatically and safely without triggering any unintended consequences. This is our long term goal and more research is required before we have a concrete implementation plan set. It could be combined with the anonymized production clones from phase 2 so that we can have all migration output available in CI job logs or it could run against a non-anonymized production clone with no job logs provided and only a success / fail message, so that users with production data access (database maintainers) can then manually check the issues only in case of failures. ----- Relates to: Reduce failed DB migrations that are caught in production - https://gitlab.com/gitlab-org/gitlab/-/issues/247537#note_412108199 cc @gitlab-org/database-team @nikolays
epic