Database Decomposition Strategies: Near Zero-Downtime Migration
Database Decomposition Strategies: Near Zero-Downtime Migration
Overview
As part of our ongoing infrastructure optimization efforts, the team has been exploring strategies for decomposing functional groups of database tables from our Main cluster with near-zero downtime. This document outlines approaches for migrating specific functional tables to dedicated Patroni clusters while ensuring continuous service availability.
To illustrate these approaches, we'll use the Security database decomposition as a concrete example. However, the strategies and considerations outlined here can be applied to decompose any set of functionally related tables from the Main cluster.
Initially, I proposed an approach to the team that leveraged logical replication with Consul service management for handling the migration through service endpoint manipulation. After further analysis, I developed an alternative strategy using logical replication with PgBouncer reconfiguration, which offers more controlled cutover capabilities. Both approaches are detailed below, with their respective trade-offs and considerations.
Current Infrastructure
Patroni Clusters
- Main Cluster: Currently hosts all database tables
- Security Cluster: Target cluster for Security-specific tables
PgBouncer Configuration
-
RW Traffic:
- Main-RW-PgBouncers: Standalone PgBouncers and sidekiq PgBouncers for Main cluster
- Sec-RW-PgBouncers: Standalone PgBouncers and sidekiq PgBouncers for Security cluster
-
Read Traffic:
- Local PgBouncers on Main Patroni cluster nodes
- Local PgBouncers on Security Patroni cluster nodes
Proposed Solutions
Solution 1: Custom Consul Service Configuration
This approach involves manipulating Consul services to manage the transition between clusters.
Initial Setup
- Sec. Application connects to:
-
sec-db-replica.service.consul/replica.patroni-sec.service.consulfor Read traffic -
pgbouncer-sec&pgbouncer-sidekiq-secPgBouncers for RW traffic
-
- Configure
pgbouncer-sec&pgbouncer-sidekiq-secPgBouncers to connect tomaster.patroni-sec - Security cluster starts as a physical standby
- Initially stop Consul service on patroni-sec cluster
Challenges
- Manual Consul agent configuration required on the Main cluster
- Cannot use Chef as patroni cookbook lacks this functionality
-
master.patroni-secandsec-db-replica.service.consul/replica.patroni-sec.service.consulmust point to main cluster - Risk of Chef conflicts during extended manual configuration
Migration Steps
-
Preparation
- Convert Physical to Logical replication
- Pause DDLs on the Main cluster for Security tables
-
Read Traffic Switchover
- Reconfigure Consul, one node at a time, on Main to remove the Main cluster nodes from the
sec-db-replica.patroniConsul service - Start Consul service, one node at a time, on the Sec cluster and confirm
sec-db-replica.patronipoints to replicas of the patroni-sec cluster - At this point read traffic should go to patroni-sec cluster.
- Perform Validations/QA testing
- Rollback path: revert to
db-replica.service.consul/replica.patroni.service.consulif issues arise. Rollback steps - Reconfigure Consul on Main to add the Main cluster nodes for thesec-db-replica.patroniConsul service. Stop Consul service on Sec cluster and confirmsec-db-replica.patronipoints to replicas of the Main cluster
- Reconfigure Consul, one node at a time, on Main to remove the Main cluster nodes from the
-
RW Traffic Switchover
- Pause
pgbouncer-sec&pgbouncer-sidekiq-sec - Update Postgres Sequences of Security tables on patroni-sec cluster
- Verify zero logical replication lag
- Remove Consul agent config on Main for
master.patroni-secConsul service - Start Consul on patroni-sec cluster
- Make sure
master.patroni-secandsec-db-replica.patroni-secpoint to the patroni-sec cluster. Traffic must not flow to the Main cluster for Security tables! - Drop LR subscription on patroni-sec cluster
- Establish reverse logical replication - Create publication on patroni-sec cluster and subscription on the Main cluster
- Resume PgBouncers
- Rollback path:
- Use pgbouncer's Pause and Resume
- Reverse the RW Traffic Switchover steps to reconfigure Consul services
- Stop Consul on patroni-sec cluster
- Pause
Risks
- Complex Consul configuration - manual or automated reconfiguration steps using Ansible as Chef cookbook lacks this functionality atm.
- Might require disabling Chef for a longer duration.
- Manual / semi-automated operations - need to ensure we avoid split-brain situations.
Solution 2: PgBouncer reconfiguration
This solution leverages PostgreSQL's logical replication capabilities with PgBouncer reconfiguration for a more controlled migration.
Initial Setup
- Sec. Application connects to:
-
db-replica.service.consul/replica.patroni.service.consulfor Read traffic -
pgbouncer-sec&pgbouncer-sidekiq-secPgBouncers for RW traffic - Configure
pgbouncer-sec&pgbouncer-sidekiq-secPgBouncers to connect tomaster.patroni
-
- No special Consul configuration needed
- Security cluster starts as a physical standby
Migration Process
Phase 4: Validate Read Traffic Switchover
The following steps will be executed during the Sec DB Decomposition Phase 4 Change Request window:
-
Preparation
- Prepare MR to redirect Read application traffic from Main Cluster to Sec Cluster
-
Read Traffic Switchover to Sec Cluster
- Confirm the physical replication from the Main cluster to Sec cluster is working properly with minimal replication lag
- Merge application MR to switch Sec application to connect to
sec-db-replica.service.consulinstead ofdb-replica.service.consul - Monitor tuple statistics of Sec application tables to confirm Sec application is successfully using the Sec Cluster replicas
- Observe traffic patterns on Sec cluster's local PgBouncers running on Sec cluster replicas
- Perform comprehensive validation and QA testing for the defined monitoring period
-
Controlled Rollback: Read Traffic Switchover back to Main Cluster
- After successful validation and completion of the monitoring period, revert the application MR to switch Sec application back to
db-replica.service.consulinstead ofsec-db-replica.service.consul - Confirm Sec application functionality remains intact
- Monitor tuple statistics to verify Sec application is properly routing queries to the Main Cluster replicas
- Perform final validation tests to confirm successful rollback
- After successful validation and completion of the monitoring period, revert the application MR to switch Sec application back to
Phase 7: Production Sec Decomposition and Cutover
The following steps will be executed during the Sec DB Decomposition Phase 7 Change Request window:
-
Preparation
- Convert Physical to Logical replication
- Pause DDLs on the Main cluster for Security tables
-
Read Traffic Switchover
- Switch Sec. application to connect to
sec-db-replica.service.consulinstead ofdb-replica.service.consul. The application should be able to handle this change gracefully. - Perform Validations/QA testing
-
Rollback path: revert Sec. application to connect to
db-replica.service.consulif issues arise
- Switch Sec. application to connect to
-
RW Traffic Switchover
- Disable chef-client on
pgbouncer-sec&pgbouncer-sidekiq-secPgBouncers - Pause
pgbouncer-sec&pgbouncer-sidekiq-sec - Update Postgres Sequences of Security tables on patroni-sec cluster
- Verify zero logical replication lag
- Drop LR subscription on patroni-sec cluster
- Create write-lock triggers on the Security tables in the Main cluster
(need to investigate how to implement this—either as a rake task, an SQL statement block, or via an Ansible playbook). - Establish reverse logical replication - Create publication on patroni-sec cluster and subscription on Main cluster
- Update PgBouncer configurations via switchover playbook: Update
pgbouncer-sec&pgbouncer-sidekiq-secto connect tomaster.patroni-secinstead ofmaster.patroniand reload PgBouncers (it should reload database.ini but if reload PgBouncers does resume the paused transactions, we will have a brief downtime) - Validate database pool connectivity - it should connect to
master.patroni-sec - Resume PgBouncers
- Post-RW Traffic Migration tasks:
- Merge Chef MR to persist configuration changes: Update
pgbouncer-sec&pgbouncer-sidekiq-secto connect tomaster.patroni-sec - Run chef-client sequentially on PgBouncers: Verify RW traffic continues to
master.patroni-sec - Enable Chef on all PgBouncers
- Merge Chef MR to persist configuration changes: Update
-
Rollback path: If issues arise, GitLab team prefers to rollback both, RW Traffic and Read Traffic, back to the Main cluster. The high-level steps would be
- Disable chef-client on
pgbouncer-sec&pgbouncer-sidekiq-secPgBouncers - Drop triggers on Sec tables on the Main cluster
- Pause
pgbouncer-sec&pgbouncer-sidekiq-sec - Update Postgres Sequences of Security tables on Main cluster
- Wait until the logical replication lag is zero bytes
- Drop logical replication subscription on the Main cluster
- Create write-lock triggers on the Security tables in the Sec cluster
- Update PgBouncer configurations via switchover rollback playbook: Update
pgbouncer-sec&pgbouncer-sidekiq-secto connect tomaster.patroniinstead ofmaster.patroni-secand reload PgBouncers (it should reload database.ini but if reload PgBouncers does resume the paused transactions, we will have a brief downtime) - Read Traffic rollback back to the Main cluster: Switch Sec. application to connect to
db-replica.service.consulinstead ofsec-db-replica.service.consul. The application should be able to handle this change gracefully. - Validate database pool connectivity - it should connect to
master.patroni - Resume PgBouncers
- Post-Read and RW Traffic rollback tasks:
- Merge Chef MR to persist configuration changes (
pgbouncer-sec&pgbouncer-sidekiq-secto connect tomaster.patroni) - Run chef-client sequentially on PgBouncers: Verify RW traffic continues to
master.patroni - Enable Chef on all PgBouncers
- Drop publication on the Sec cluster
- Merge Terraform MR to destroy Sec cluster
- Merge Chef MR to persist configuration changes (
- Disable chef-client on
- Disable chef-client on
Advantages
- More controlled migration process
- No complex Consul manipulation required
- Avoid a potential split-brain situation
Recommendation
Solution 2 (Logical Replication Approach) appears to be the more robust and manageable option, offering:
- Clearer rollback paths
- Less manual intervention
- Reduced risk of configuration errors
- Better maintainability during and after migration
Next Steps
- Review and refine the steps
- Design safeguards and incorporate steps (e.g. write-lock triggers on the Security tables in the Main cluster) to prevent application write-backs to the Main cluster post-RW Traffic switchover
- Creation of specific playbook tasks for each migration step
- Development of comprehensive testing procedures
- Definition of success criteria and rollback trigger