Decomposition: Rollout decomposed CI database on GitLab.com (staging and production)
# Rollout change request https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7111#change-details ## Overview The decision was made to pursue [Decomposition](https://gitlab.com/groups/gitlab-org/-/epics/5883#design-overview) for database sharding. The database sharding will impact many areas of GitLab product and operations. This epic describes our high-level plan to rollout a decomposed database to staging and then production. It also highlights required support that is needed from the Delivery team. We expect that the general flow of rolling out changes is: ```mermaid graph LR A[Sandbox] --> B[Staging] --> C[Production] ``` ## 2022 Timeline Based on https://gitlab.com/groups/gitlab-org/-/epics/6160#note_805707429 ```mermaid gantt dateFormat YYYY-MM-DD title 50% confidence timeline section Work Infra work for new database :active , infradb, 2022-01-01, 2022-03-15 Remove foreign keys :active , fks, 2022-01-01, 2022-02-10 Ops phase 4 work :active , opsphase4, 2022-01-01, 2022-01-31 Ops phase 6 work :active , opsphase6, 2022-01-01, 2022-01-31 Local dev multi-DB default : multilocaldev, after fks, 28d .gitlab-ci.yml multi-DB default : multidbci, after fks, 28d DB migration tooling for multi-DB : dbmigrationtooling, 2022-01-01, 52d Ship phase 3 : phase3, 2022-02-01, 2022-03-18 Ship phase 4 : phase4, after phase3, 2022-04-26 Ship phase 5 : phase5, after phase4, 21d Phase 6 : phase6, after phase4 multilocaldev multidbci opsphase6, 21d Ship phase 7 : phase7, after phase6 dbmigrationtooling, 28d axisFormat %b-%d ``` ## Feature flags | Feature Flags | Description | Link | Phase 1 | Phase 2 | Phase 3 | Phase 4 | Phase 5 | Phase 6 | Phase 7 | |----------------------------------------------|------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|---------|---------|---------------|---------------|---------------|---------------|---------------| | GITLAB_LOAD_BALANCING_REUSE_PRIMARY_ci | Enables re-use of primary connection by the CI | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/72698 | n/a | n/a | set to 'main' | set to 'ci' | set to 'ci' | set to 'ci' | set to 'ci' | | use_model_load_balancing | Enables usage of CI replicas for read-only queries | https://gitlab.com/gitlab-org/gitlab/-/issues/344797 https://gitlab.com/gitlab-org/gitlab/-/merge_requests/73631 | n/a | n/a | 0.01% | 0.01% | 100% | 100% | 100% | | query_analyzer_gitlab_schema_metrics | Controls percentage of queries parsed for cross-join detection and tracking via prometheus metrics in production | https://gitlab.com/gitlab-org/gitlab/-/issues/345034 https://gitlab.com/gitlab-org/gitlab/-/merge_requests/73839 | n/a | n/a | 0.01% | 100% | 100% | 100% | 100% | | ENABLE_CROSS_DATABASE_MODIFICATION_DETECTION | Enables exception tracking of cross database modification | https://gitlab.com/gitlab-org/gitlab/-/issues/344620 https://gitlab.com/gitlab-org/gitlab/-/merge_requests/74177 | n/a | n/a | set to 'true' | set to 'true' | set to 'true' | set to 'true' | set to 'true' | | detect_cross_database_modification | Controls percentage of queries sampled for cross-database modification | https://gitlab.com/gitlab-org/gitlab/-/issues/344620 https://gitlab.com/gitlab-org/gitlab/-/merge_requests/74177 | n/a | n/a | 0.01% | 100% | 100% | 100% | 100% | - names in capital letters indicate **environment variables** - names in lower-case letters indicate feature flags in development environment: https://docs.gitlab.com/ee/development/feature_flags/#:~:text=development%20feature%20flags%20are%20short,Feature%20Flag%20Roll%20Out%20template. - edit source and copy the whole table in place of the above to make changes (accessible for GitLab Inc): https://docs.google.com/spreadsheets/d/1MU58XnnfXwVO1VXs5mrMMGE54xMBWNDfHqXEKYqYbEs/edit?usp=sharing ## Roll out plan This is the high-level plan to rollout into staging, and production. In contrast to production, we expect that we will be rollback and forward several times. We expect that things may surface during these phases. **Diagram notes** * We've simplified some components for the sake of clarity. Detailed steps for each phase should be captured in the issues. * Reads are also served by the main/CI databased; we've omitted that detail. * Number of read-only replicas is not accurate. * Source https://drive.google.com/file/d/1bBSmluc5eFubIqmy15LipTuhUisylZSs/view?usp=sharing ### Phase 1: Deploy a Patroni cluster | ENV | Status | Date | | ------ | ------ | -----| | STG | **DEPLOYED** | 2022-01-31 | | [PRD](https://gitlab.com/gitlab-org/gitlab/-/issues/351567) | **DEPLOYED** | 2022-03-18 | **Phase duration**: Days to weeks **Summary:** Infrastructure will provision a Patroni cluster. This cluster is not yet required to be connected to the main cluster. We only need to establish that we can deploy (and re-deploy) this cluster and that all required monitoring and logging capabilities are in place. Details in https://gitlab.com/gitlab-org/gitlab/-/issues/341821 ### Phase 2: Configure Patroni standby cluster. | ENV | Status | Date | | ------ | ------ | -----| | STG | **DEPLOYED** | 2022-01-31 | | [PRD](https://gitlab.com/gitlab-org/gitlab/-/issues/351572) | **DEPLOYED** | 2022-03-18 | **Phase duration**: Days to weeks **Summary:** Infrastructure will configure the unused Patroni cluster as a standby cluster (see https://patroni.readthedocs.io/en/latest/replica_bootstrap.html#standby-cluster). The entire Main database will be replicated via cascading streaming replication. Details in https://gitlab.com/gitlab-org/gitlab/-/issues/351571 and https://gitlab.com/gitlab-org/gitlab/-/issues/351572 ### Phase 3: Serve CI reads from CI standby cluster | ENV | Status | Date | | ------ | ------ | -----| | STG | **DEPLOYED** | 2022-02-04 | | [PRD](https://gitlab.com/gitlab-org/gitlab/-/issues/351568) | **DEPLOYED** | 2022-03-24 | **Phase duration**: Days to weeks **Summary:** In this phase read traffic _only_ for CI data will be served from the CI database. We require a way to share a primary write connection while using a separate read replica. **Infrastructure requirements:** * Enable reads from CI replicas on staging **Application requirements:** * Ability to share primary write connection while separate read replicas configured https://gitlab.com/gitlab-org/gitlab/-/issues/341451 * Safe rollout and rollback/feature flag option for enabling reads from CI read replicas https://gitlab.com/gitlab-org/gitlab/-/issues/342487 **Schematic overview:** ![phase_3](/uploads/b56523103129a73c18d9fb66ea166a97/phase_3.png) ### Phase 4: Separate write connections for CI and Main (still going to the same primary host) | ENV | Status | Date | | ------ | ------ | -----| | [STG](https://gitlab.com/gitlab-org/gitlab/-/issues/351569) | **DEPLOYED** | 2022-03-31 | | [PRD](https://gitlab.com/gitlab-org/gitlab/-/issues/351570) | **DEPLOYED** | 2022-04-28 | **Phase duration**: Days to weeks **Summary:** Here we will use two different connections for writes. This means that the application must be able to work with different databases. For the application the CI database happens to be located at the Main database. **Infrastructure requirements:** * Point CI PGBouncer at Main primary and configure CI writes to go through this new PGBouncer **Application requirements:** * Solve all problems caused by 2 different primary connections (eg. cross-DB modifications [`CrossDatabaseModificationAcrossUnsupportedTablesError`](https://gitlab.com/gitlab-org/gitlab/-/blob/a1b4d1feb3901ca1f2ad250f7442a75133b16233/spec/support/database/prevent_cross_database_modification.rb#L5), maybe others related to transactions in particular) . This means [empty cross-database-modification-allowlist.yml](https://gitlab.com/gitlab-org/gitlab/blob/master/spec/support/database/cross-database-modification-allowlist.yml#L1) and [no `allow_cross_database_modification_within_transaction`](https://gitlab.com/search?search=allow_cross_database_modification_within_transaction&group_id=9970&project_id=278964&scope=&search_code=true&snippets=false&repository_ref=master&nav_source=navbar) * All CI tables are marked as using the CI connection https://gitlab.com/gitlab-org/gitlab/-/issues/343747 * Foreign keys that span multiple databases will need to be removed and a suitable alternative used (likely many will use ["loose foreign keys"](https://gitlab.com/gitlab-org/gitlab/-/issues/338535)). This is a requirement due to https://gitlab.com/gitlab-org/gitlab/-/issues/347377 **Schematic overview:** ![phase_4](/uploads/3ac1347152943f90863c9e95b8ebe8b8/phase_4.png) ### :red_circle: Phase 5: Staging dry-run and finishing the migration plan | ENV | Status | Date | | ------ | ------ | -----| | [STG](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6576) | STARTED | | | PRD | N/A | | **Phase duration**: Minutes **Summary:** This is the most critical part of the process. We are risking data loss and/or a split brain situation and will need to dry run this several times. During this phase we'll test and iterate on the actual failover procedure in local, preprod and staging environments. The output of this stage is completing the migration template in https://gitlab.com/gitlab-com/gl-infra/db-migration/-/merge_requests/181 . **Infrastructure requirements:** * Ansible scripts for downtime and failover proces completed * Dry-run on staging for the migration plan (ie. blocking writes for a period and testing that we catch up replicating CI tables) * Implement and test a rollback plan on staging https://gitlab.com/gitlab-org/gitlab/-/issues/361759 **Application requirements:** ### Phase 6: Validation of metrics and additional logging | ENV | Status | Date | | ------ | ------ | -----| | STG | N/A | | | PRD | STARTED | | **Summary:** At this point we want to be absolutely sure that the application will correctly work with 2 databases. We want to make sure that all available metrics support all the assumptions we have about how the application and the database interact. Some example validation we may wish to do: 1. [x] Analyze query logs or other monitoring (Prometheus) data to confirm that the correct connection/PGBouncer instance is being used for the correct queries (ie. main connection never queries CI, CI connection never queries main) 2. [ ] Ship an extra query analyzer to production that will log queries to the wrong database so we're sure this isn't happening https://gitlab.com/gitlab-org/gitlab/-/issues/347402 3. [x] Analysis of [`gitlab_database_decomposition_gitlab_schemas_used` in Prometheus](https://thanos-query.ops.gitlab.net/graph?g0.expr=gitlab_database_decomposition_gitlab_schemas_used&g0.tab=0&g0.stacked=0&g0.range_input=1w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) 1. Confirm there are no queries with multiple `gitlab_schemas` 1. Confirm that all `db_config_name=main` or `db_config_name=main_replica` queries correspond to `gitlab_schemas=main` only 1. Confirm that all `db_config_name=ci` or `db_config_name=ci_replica` queries correspond to `gitlab_schemas=ci` only **Infrastructure requirements:** * None **Application requirements:** * [x] Solve all application ~"sharding-blocker" issues caused by running with two databases. Compared to the previous step this will mean we need to solve: * [x] cross-DB queries [`CrossJoinAcrossUnsupportedTablesError`](https://gitlab.com/gitlab-org/gitlab/-/blob/a1b4d1feb3901ca1f2ad250f7442a75133b16233/spec/support/database/prevent_cross_joins.rb#L22) need to be removed * [x] Local GDK development should be configured with 2 write connections by default like phase 4 https://gitlab.com/gitlab-org/gitlab/-/issues/349589 * `.gitlab-ci.yml` configured to use 2 different DB connections ### :red_circle: Phase 7: Promotion of CI database | ENV | Status | Date | | ------ | ------ | -----| | STG | **NOT STARTED** | | | PRD | **NOT STARTED** | | **Summary:** At this point the CI standby cluster and Main cluster should contain identical CI data. We will update PGBouncer config to point CI writes to the new CI database. After this point any `ci_*` tables in `main` DB will be stale and similarly non `ci_*` tables in `ci` will be stale data (for removal at some time TBD). **Infrastructure requirements:** * Test and roll back full migration procedures on staging multiple time * Schedule the full staging migration with no roll-back **Application requirements:** * Database migration tooling works for appropriately. **Schematic overview:** ![phase_6__corrected_](/uploads/b961aa27ec1ce0f96898a90beb0a0b78/phase_6__corrected_.png) ## Rollout plan Our intention is to take all phases through all environments ideally 1 phase at a time. This may end up being impractical so sometimes a specific environment may be more than 1 phase ahead at a point in time. The phases are designed to minimize the amount of time we need our environments to differ and mitigate risk by allowing each phase to only make minimal changes with the ability to rollback always except for the final migration.
epic