Decomposition: Rollout decomposed Sec database on GitLab.com (staging and production)
# Overview
The [Sec DB Decomposition Working Group](https://handbook.gitlab.com/handbook/company/working-groups/secure-govern-database-decomposition/) aims to move Sec tables to a separate database and a new schema, `gitlab_sec`, has been created as a target for the tables in scope.
This epic describes the plan to rollout the decomposed `sec` database to the staging and production environments.
See also:
- Parent epic: https://gitlab.com/groups/gitlab-org/-/work_items/14165+
- This plan was based on the CI plan: https://gitlab.com/groups/gitlab-org/-/epics/6160?force_legacy_view=true
# Timeline
In the gantt chart below, the the rollout effort (this epic) correspond to the `Phase n` items.
```mermaid
gantt
dateFormat YYYY-MM-DD
title 80% confidence timeline
section Work
Gitlab Decomposition Ready :active , decompose, 2024-07-01, 2025-02-14
Non-Slice Work :active, nonslicework, 2024-07-15, 2025-02-14
Slice 1 :active, slice1, 2024-07-23, 2025-01-13
Slice 2 :active, slice2, 2024-08-06, 2024-12-30
Slice 3 :active, slice3, 2024-07-15, 2025-02-10
Gitlab Application Ready for Decomposition :milestone, allslices, after slice1 slice2 slice3 nonslicework, 0d
Phase 1 & 2 : phase12, 2024-09-11, 16w
Phase 4 : phase4, 2025-03-14, 3w
Phase 5 : phase5, after phase4, 1w
Phase 6 : phase6, 2025-03-21, 3w
Phase 7 : phase7, after phase6, 1w
Rollout complete :milestone, rollout, 2025-04-19, 1d
axisFormat %Y-%m-%d
```
# Feature flags
| Name | Description | Link | Phase |
|------|-------------|------|-------|
| None | | | |
# Rollout plan
## Phase 1: deploy Patroni cluster
| ENV | Status | Date |
|-----|--------|------|
| STG | [Completed](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18433 "[GSTG][Security Decomp]Stand Up Security Decomp Cluster + pgbouncer") | 2024-11-01 |
| PRD | [Completed](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19214) | 2025-02-25 |
**Summary:** Infrastructure will provision a Patroni cluster. This cluster is not yet required to be connected to the main cluster. We only need to establish that we can deploy (and re-deploy) this cluster and that all required monitoring and logging capabilities are in place.
**Infrastructure requirements:**
* [x] Deployment of `gitlab-sec` Patroni cluster to staging
* [x] Deployment of `gitlab-sec` Patroni cluster to production
**Application requirements:**
* None
## Phase 2: configure Patroni standby cluster
| ENV | Status | Date |
|-----|--------|------|
| STG | [Completed](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18433) | 2024-10-31 |
| PRD | [Completed](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19214) | 2025-02-25 |
**Summary:** Infrastructure will configure the unused Patroni cluster as a standby cluster (see https://patroni.readthedocs.io/en/latest/replica_bootstrap.html#standby-cluster). The entire Main database will be replicated via cascading physical replication before transitioning to logical in later phase.
**Infrastructure requirements:**
* [x] Configuration of `gitlab-sec` Patroni cluster on staging as standby
* [x] Configuration of `gitlab-sec` Patroni cluster on production as standby
**Application requirements:**
* None
**Data Analytics Requirement:**
1. Post setup of the staging cluster for security instance. Data analytics to modify the pipeline under project [data-server-rebuild-ansible](https://gitlab.com/gitlab-com/gl-infra/data-server-rebuild-ansible) to create a new snapshot environment in the db_benchmarking, similar to current main and ci database. **(To implement this we have dependency with DBRE and SRE team)** https://gitlab.com/gitlab-data/analytics/-/issues/21930
2. Post setup of production cluster for security instance. Data analytics to modify the pipeline under project [data-server-rebuild-ansible](https://gitlab.com/gitlab-com/gl-infra/data-server-rebuild-ansible) to create a new snapshot environment in the db_benchmarking production instance , similar to current main and ci database. Also remove the reference to staging cluster.**(To implement this we have dependency with DBRE and SRE team)**
3. Data Analytics team can reach out to the new cluster created from data-ops GKE cluster to ensure the pipeline can be pointed.
**Schematic overview:**
> 
## Phase 3: SKIPPED
| ENV | Status | Date |
|-----|--------|------|
| STG | _SKIPPED_ | |
| PRD | _SKIPPED_ | |
**Summary:** Previously planned to read traffic _only_ for Sec data from the `gitlab-sec` database, however we have combined the rollout into Phase 4 to switch both reads and writes simultaneously for Sec data from `main` to `sec`.
This phase is now essentially a No-op but remains in place to prevent writing history and keep some uniformity with the execution phases of the previous CI rollout.
**Infrastructure requirements:**
* None
**Application requirements:**
* None
## Phase 4: separate read and write connection for Sec from Main (still going to the same primary host)
| ENV | Status | Date |
|-----|--------|------|
| STG | [Completed](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19546) | 2025-03-27 |
| PRD | [Completed](https://gitlab.com/gitlab-org/gitlab/-/issues/509191) | 2025-04-02 |
**Summary:** Here we will use different connections for `gitlab_main` and `gitlab_sec`. This means that the application must be able to work with different databases. For the application, the Sec database happens to be located at the Main database.
We plan on temporarily directing writes to the patroni-sec cluster to observe behavior before reverting reads back to the main cluster until physical-to-logical switchover is performed.
**Infrastructure requirements:**
* [x] Point Sec PGBouncer host at Main primary and configure Sec writes to go through this new PGBouncer, reads to go through sec_replicas
* [x] After, observation time has passed, redirect Sec writes back to main's `db-replica.service.consul`
**Application requirements:**
* [x] All Sec tables inherit from `SecApplicationRecord` for configurable `sec` DB connection
* [x] All Sec tables have dropped foreign keys to gitlab_main or gitlab_ci tables, relying on [LFKs](https://docs.gitlab.com/ee/development/database/loose_foreign_keys.html) where necessary
* [x] [Enable CI multiple DB testing by default for `gitlab_main` and `gitlab_sec` databases](https://gitlab.com/gitlab-org/gitlab/-/issues/493983)
* [x] Solve all known problems caused by 2 different primary connections (i.e. cross-transactions and cross-DB modifications. Close out remainder of issues in Slices 1, 2, and 3 and remove remaining allowances of `allow_cross_joins_across_databases` and `PreventCrossDatabaseModification.temporary_ignore_tables_in_transaction` allowances.
* [x] Implement and test rollback strategy on staging https://gitlab.com/gitlab-org/gitlab/-/issues/509034+
**Data Analytics requirements:**
1. Data Analytics current pipeline fetching data from main database for security related tables should be able to read data from security clusters for daily load.
2. Remove reference to the main database for the security related tables.
**Schematic overview:**
> 
## Phase 5: staging dry-run and finishing the migration plan
| ENV | Status | Date |
|-----|--------|------|
| STG | [Completed](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19684) | |
| PRD | N/A | |
**Summary:** This is the most critical part of the process. We are risking data loss and/or a split brain situation and will need to dry run. During this phase we'll test and iterate on the actual failover procedure in local, preprod and staging environments.
**Infrastructure requirements:**
* [x] Ansible scripts for downtime and failover process completed (See https://gitlab.com/gitlab-com/gl-infra/db-migration/-/merge_requests/571)
* [x] Dry-run on staging for the migration plan (ie. blocking writes for a period and testing that we catch up replicating Sec tables) (See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18406#note_2067779759+)
* [x] [Conversion of physical to logical replication](https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-physical-to-logical?ref_type=heads) for patroni cluster on staging
* [x] Implement and test a rollback plan on staging https://gitlab.com/gitlab-org/gitlab/-/issues/508382+
* [x] Update `prometheus_alert_db_indicators_settings` configuration for gstg environment (See [CR gstg](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19031))
**Application requirements:**
* [x] Ensure any pending background batch migrations against `sec` fields are finalized prior to switchover, Relates to https://gitlab.com/gitlab-org/gitlab/-/issues/482647+
## Phase 6: validation of metrics and additional logging
| ENV | Status | Date |
|-----|--------|------|
| STG | Completed | |
| PRD | _Ongoing_ | |
**Summary:** At this point we want to be sure that the application works correctly with the additional database. We want to make sure available metrics support all assumptions about how the application and the database interact. Some example validation:
* [x] Analysis of query logs to confirm that the correct connection/PGBouncer instance is being used for the correct queries (ie. main connection never queries sec, sec connection never queries main)
* [Sentry](https://new-sentry.gitlab.net/organizations/gitlab/issues/?query=is%3Aunresolved+QueryAnalyzers+message%3A\*gitlab_sec\*&referrer=issue-list&statsPeriod=24h)
* [Kibana](https://log.gprd.gitlab.net/app/r?l=DISCOVER_APP_LOCATOR&v=8.12.1&lz=N4IgjgrgpgTgniAXKSsGJCANCANgQwDsBzCfYqJEAa2nhAF8cBnAexgBckBtbkAKzaEAdBwCWAW0o4AJlGYBjEAF1lOBa1wQJhZjzUgxhOQA8qATgDsAMwAMAFgBMt2wFpbj%2BwCNXARl9QPgAcAQCsfqGhXviWQZahQTL22IaEHLAAbvi4VPgQHKwp1mK46TB6iNygUhz4SKBGphY2Ds5uHt5%2BAcFhEVExcQlJKRxwAA6UGAoQzAUSKTJizPheuFAySNbZzFA4hFDE%2BOmb27sg2WL4FYQQuLg41FDo4HQIOFlakyDAADogXqxNH9kH9mAALVi3GTAqp%2FAFA%2Bp%2FCQzDgwtR%2FYqlWAw34gCRHBRggD6YzBMCuUGBuMErBEMi8RI0hGKxCJhHwUmEjzgAHd2NCkKCoAo%2FgwGOiQODIbgBZUJcjZmzWKj9GKsLj4bgqUiUWisBiSmUcUiCcTSeSdtqBEJhPTGbSWWyOVAuU8%2BTBZUKFESYFAxrgxAp8KLxfrJRCoXqdYrCMq0WL5UZJNoiVKoUT8RxCcDfGLGDgACSzI6TUCzdhffBjMYAZVqxyYLzQ9X%2BgJyyHD0o2lVAmpbCq4cpwmLKPGqppJZIpLZpdIZTMd7M53Pd3clwsYoc7UP0OAHSsH3GUat7bf7KN3IBHsDHeIn5unHdntvnDrErKXLpX%2FKoO29vv9QM6gTFgIxlS991jQ9jy3CQk2RCRUzAmQM1NJBcxAkAZCOfAADUxCgHkAEk1ysOwnBcdxPB8fxAlcEIoHCXxImiWJ4kSZIcHEKQACUiAoFtrBgVh5gwWMeQieYuMKMTWB5fMQF9IT5DBIi0kybIWw%2BaAkAANhcFwcDGPIdiQDgYGgMUgA)
* [x] Analysis of [`gitlab_database_decomposition_gitlab_schemas_used` in Prometheus](https://dashboards.gitlab.net/explore?schemaVersion=1&panes=%7B%228dg%22:%7B%22datasource%22:%22e58c2f51-20f8-4f4b-ad48-2968782ca7d6%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22sum%28rate%28gitlab_database_decomposition_gitlab_schemas_used%7Bgitlab_schemas%3D~%5C%22.%2Asec.%2B%5C%22%7D%5B5m%5D%29%29%20by%20%28app,gitlab_schemas%29%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22e58c2f51-20f8-4f4b-ad48-2968782ca7d6%22%7D,%22editorMode%22:%22code%22,%22legendFormat%22:%22__auto%22%7D%5D,%22range%22:%7B%22from%22:%22now-2d%22,%22to%22:%22now%22%7D%7D%7D&orgId=1)
1. Confirm there are no queries with multiple `gitlab_schemas` inclusions to `gitlab_sec` (across `_main`, `_clusterwide`, or `_ci`)
2. Confirm that all `db_config_name=main` or `db_config_name=main_replica` queries correspond to `gitlab_schemas=main` only
3. Confirm that all `db_config_name=sec` or `db_config_name=sec_replica` queries correspond to `gitlab_schemas=sec` only
* [x] https://gitlab.com/gitlab-org/gitlab/-/issues/474934+
**Infrastructure requirements:**
* None
**Application requirements:**
* [x] Solve newly discovered application-level cross-DB queries and transactions belong to ~"sec-decomposition".
## Phase 7: promotion of Sec database
| ENV | Status | Date |
|-----|--------|------|
| STG | [Completed](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19684) | 2025-04-18 |
| PRD | [Completed](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19581) | 2025-04-19 |
**Summary:** At this point the Sec standby cluster and Main cluster should contain identical Sec data. We will update Sec's PGBouncer config to point writes to the new Sec database and keep reads pointed to the new Sec database.
We will enable reverse logical replication to write `sec_*` tables back to main until we feel confident in the split.
Once correct behavior has been validated, we can disable forward logical replication and truncate the sec tables within main.
**Infrastructure requirements:**
* [x] Test and roll back full migration procedures on staging multiple times (see [rollback strategy](https://gitlab.com/gitlab-org/gitlab/-/issues/508382) and [completion](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19031))
* [x] Schedule the [full staging migration](https://gitlab.com/gitlab-org/gitlab/-/issues/511958) with no roll-back
* [x] [Conversion of physical to logical replication](https://gitlab.com/gitlab-com/gl-infra/db-migration/-/tree/master/pg-physical-to-logical?ref_type=heads) for Patroni Sec cluster on production
* [-] Enable maintenance-mode; block traffic to patroni clusters.
* [x] Enable reverse logical replication to write sec tables back to main in case of need for rollback
* [x] Update `prometheus_alert_db_indicators_settings` configuration for gprd environment (See [CR gprd](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19032))
* [x] Upon successful verification of rollout
* [x] [Lock writes on the tables that don’t belong to the database schemas](https://docs.gitlab.com/ee/development/database/multiple_databases.html#locking-writes-on-the-tables-that-dont-belong-to-the-database-schemas)
* [ ] Truncate `sec` tables within `gitlab_main` - See [CI Change Request for example](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7511))
**Application requirements:**
* [x] Database migration tooling works for appropriately.
**Schematic overview:**
> 
epic