Skip to content

Database Decomposition Strategies: Near Zero-Downtime Migration

Database Decomposition Strategies: Near Zero-Downtime Migration

Overview

As part of our ongoing infrastructure optimization efforts, the team has been exploring strategies for decomposing functional groups of database tables from our Main cluster with near-zero downtime. This document outlines approaches for migrating specific functional tables to dedicated Patroni clusters while ensuring continuous service availability.

To illustrate these approaches, we'll use the Security database decomposition as a concrete example. However, the strategies and considerations outlined here can be applied to decompose any set of functionally related tables from the Main cluster.

Initially, I proposed an approach to the team that leveraged logical replication with Consul service management for handling the migration through service endpoint manipulation. After further analysis, I developed an alternative strategy using logical replication with PgBouncer reconfiguration, which offers more controlled cutover capabilities. Both approaches are detailed below, with their respective trade-offs and considerations.

Current Infrastructure

Patroni Clusters

  • Main Cluster: Currently hosts all database tables
  • Security Cluster: Target cluster for Security-specific tables

PgBouncer Configuration

  • RW Traffic:
    • Main-RW-PgBouncers: Standalone PgBouncers and sidekiq PgBouncers for Main cluster
    • Sec-RW-PgBouncers: Standalone PgBouncers and sidekiq PgBouncers for Security cluster
  • Read Traffic:
    • Local PgBouncers on Main Patroni cluster nodes
    • Local PgBouncers on Security Patroni cluster nodes

Proposed Solutions

Solution 1: Custom Consul Service Configuration

This approach involves manipulating Consul services to manage the transition between clusters.

Initial Setup

  • Sec. Application connects to:
    • sec-db-replica.service.consul / replica.patroni-sec.service.consul for Read traffic
    • pgbouncer-sec & pgbouncer-sidekiq-sec PgBouncers for RW traffic
  • Configure pgbouncer-sec & pgbouncer-sidekiq-sec PgBouncers to connect to master.patroni-sec
  • Security cluster starts as a physical standby
  • Initially stop Consul service on patroni-sec cluster

Challenges

  1. Manual Consul agent configuration required on the Main cluster
    • Cannot use Chef as patroni cookbook lacks this functionality
    • master.patroni-sec and sec-db-replica.service.consul / replica.patroni-sec.service.consul must point to main cluster
    • Risk of Chef conflicts during extended manual configuration

Migration Steps

  1. Preparation
    • Convert Physical to Logical replication
    • Pause DDLs on the Main cluster for Security tables
  2. Read Traffic Switchover
    • Reconfigure Consul, one node at a time, on Main to remove the Main cluster nodes from the sec-db-replica.patroni Consul service
    • Start Consul service, one node at a time, on the Sec cluster and confirm sec-db-replica.patroni points to replicas of the patroni-sec cluster
    • At this point read traffic should go to patroni-sec cluster.
    • Perform Validations/QA testing
    • Rollback path: revert to db-replica.service.consul / replica.patroni.service.consul if issues arise. Rollback steps - Reconfigure Consul on Main to add the Main cluster nodes for the sec-db-replica.patroni Consul service. Stop Consul service on Sec cluster and confirm sec-db-replica.patroni points to replicas of the Main cluster
  3. RW Traffic Switchover
    • Pause pgbouncer-sec & pgbouncer-sidekiq-sec
    • Update Postgres Sequences of Security tables on patroni-sec cluster
    • Verify zero logical replication lag
    • Remove Consul agent config on Main for master.patroni-sec Consul service
    • Start Consul on patroni-sec cluster
    • Make sure master.patroni-sec and sec-db-replica.patroni-sec point to the patroni-sec cluster. Traffic must not flow to the Main cluster for Security tables!
    • Drop LR subscription on patroni-sec cluster
    • Establish reverse logical replication - Create publication on patroni-sec cluster and subscription on the Main cluster
    • Resume PgBouncers
    • Rollback path:
      • Use pgbouncer's Pause and Resume
      • Reverse the RW Traffic Switchover steps to reconfigure Consul services
      • Stop Consul on patroni-sec cluster

Risks

  • Complex Consul configuration - manual or automated reconfiguration steps using Ansible as Chef cookbook lacks this functionality atm.
  • Might require disabling Chef for a longer duration.
  • Manual / semi-automated operations - need to ensure we avoid split-brain situations.

Solution 2: PgBouncer reconfiguration

This solution leverages PostgreSQL's logical replication capabilities with PgBouncer reconfiguration for a more controlled migration.

Initial Setup

  • Sec. Application connects to:
    • db-replica.service.consul / replica.patroni.service.consul for Read traffic
    • pgbouncer-sec & pgbouncer-sidekiq-sec PgBouncers for RW traffic
    • Configure pgbouncer-sec & pgbouncer-sidekiq-sec PgBouncers to connect to master.patroni
  • No special Consul configuration needed
  • Security cluster starts as a physical standby

Migration Process

Phase 4: Validate Read Traffic Switchover

The following steps will be executed during the Sec DB Decomposition Phase 4 Change Request window:

  1. Preparation

    • Prepare MR to redirect Read application traffic from Main Cluster to Sec Cluster
  2. Read Traffic Switchover to Sec Cluster

    • Confirm the physical replication from the Main cluster to Sec cluster is working properly with minimal replication lag
    • Merge application MR to switch Sec application to connect to sec-db-replica.service.consul instead of db-replica.service.consul
    • Monitor tuple statistics of Sec application tables to confirm Sec application is successfully using the Sec Cluster replicas
    • Observe traffic patterns on Sec cluster's local PgBouncers running on Sec cluster replicas
    • Perform comprehensive validation and QA testing for the defined monitoring period
  3. Controlled Rollback: Read Traffic Switchover back to Main Cluster

    • After successful validation and completion of the monitoring period, revert the application MR to switch Sec application back to db-replica.service.consul instead of sec-db-replica.service.consul
    • Confirm Sec application functionality remains intact
    • Monitor tuple statistics to verify Sec application is properly routing queries to the Main Cluster replicas
    • Perform final validation tests to confirm successful rollback

Phase 7: Production Sec Decomposition and Cutover

The following steps will be executed during the Sec DB Decomposition Phase 7 Change Request window:

  1. Preparation

    • Convert Physical to Logical replication
    • Pause DDLs on the Main cluster for Security tables
  2. Read Traffic Switchover

    • Switch Sec. application to connect to sec-db-replica.service.consul instead of db-replica.service.consul. The application should be able to handle this change gracefully.
    • Perform Validations/QA testing
    • Rollback path: revert Sec. application to connect to db-replica.service.consul if issues arise
  3. RW Traffic Switchover

    • Disable chef-client on pgbouncer-sec & pgbouncer-sidekiq-sec PgBouncers
    • Pause pgbouncer-sec & pgbouncer-sidekiq-sec
    • Update Postgres Sequences of Security tables on patroni-sec cluster
    • Verify zero logical replication lag
    • Drop LR subscription on patroni-sec cluster
    • Create write-lock triggers on the Security tables in the Main cluster
      (need to investigate how to implement this—either as a rake task, an SQL statement block, or via an Ansible playbook).
    • Establish reverse logical replication - Create publication on patroni-sec cluster and subscription on Main cluster
    • Update PgBouncer configurations via switchover playbook: Update pgbouncer-sec & pgbouncer-sidekiq-sec to connect to master.patroni-sec instead of master.patroni and reload PgBouncers (it should reload database.ini but if reload PgBouncers does resume the paused transactions, we will have a brief downtime)
    • Validate database pool connectivity - it should connect to master.patroni-sec
    • Resume PgBouncers
    • Post-RW Traffic Migration tasks:
      • Merge Chef MR to persist configuration changes: Update pgbouncer-sec & pgbouncer-sidekiq-sec to connect to master.patroni-sec
      • Run chef-client sequentially on PgBouncers: Verify RW traffic continues to master.patroni-sec
      • Enable Chef on all PgBouncers
    • Rollback path: If issues arise, GitLab team prefers to rollback both, RW Traffic and Read Traffic, back to the Main cluster. The high-level steps would be
      • Disable chef-client on pgbouncer-sec & pgbouncer-sidekiq-sec PgBouncers
      • Drop triggers on Sec tables on the Main cluster
      • Pause pgbouncer-sec & pgbouncer-sidekiq-sec
      • Update Postgres Sequences of Security tables on Main cluster
      • Wait until the logical replication lag is zero bytes
      • Drop logical replication subscription on the Main cluster
      • Create write-lock triggers on the Security tables in the Sec cluster
      • Update PgBouncer configurations via switchover rollback playbook: Update pgbouncer-sec & pgbouncer-sidekiq-sec to connect to master.patroni instead of master.patroni-sec and reload PgBouncers (it should reload database.ini but if reload PgBouncers does resume the paused transactions, we will have a brief downtime)
      • Read Traffic rollback back to the Main cluster: Switch Sec. application to connect to db-replica.service.consul instead of sec-db-replica.service.consul. The application should be able to handle this change gracefully.
      • Validate database pool connectivity - it should connect to master.patroni
      • Resume PgBouncers
      • Post-Read and RW Traffic rollback tasks:
        • Merge Chef MR to persist configuration changes (pgbouncer-sec & pgbouncer-sidekiq-sec to connect to master.patroni)
        • Run chef-client sequentially on PgBouncers: Verify RW traffic continues to master.patroni
        • Enable Chef on all PgBouncers
        • Drop publication on the Sec cluster
        • Merge Terraform MR to destroy Sec cluster

Advantages

  • More controlled migration process
  • No complex Consul manipulation required
  • Avoid a potential split-brain situation

Recommendation

Solution 2 (Logical Replication Approach) appears to be the more robust and manageable option, offering:

  • Clearer rollback paths
  • Less manual intervention
  • Reduced risk of configuration errors
  • Better maintainability during and after migration

Next Steps

  1. Review and refine the steps
  2. Design safeguards and incorporate steps (e.g. write-lock triggers on the Security tables in the Main cluster) to prevent application write-backs to the Main cluster post-RW Traffic switchover
  3. Creation of specific playbook tasks for each migration step
  4. Development of comprehensive testing procedures
  5. Definition of success criteria and rollback trigger
Edited by Biren Shah