Possible blocker for general PostgreSQL upgrade strategy

PostgreSQL upgrade with (near) zero downtime

In order to fulfill the requirements of executing database upgrade without (or very minimal) user facing downtime, the following procedure was chosen.

A second cluster, TARGET, is created as carbon copy of SOURCE
TARGET is synchronized to SOURCE via physical replication
TARGET is converted to sync via logical replication
TARGET is upgraded to newer major version
- upgrade takes some time
- cluster is no longer in sync
TARGET is synced and kept in synced via logical replication (asynchronously)
- logical replication takes time and is resource intensive

At this point we have two clusters,SOURCE running the old major version, TARGET the new. Both clusters are synced asynchronously, so small delays are to be accepted. Logical replication is less performant, so the delays are higher than between the physically replicate nodes within one cluster.

Now we need to decide if and when the application traffic should be routed to TARGET.

We could just make a hard cut and redirect all traffic to the new cluster at one point in time. If the cluster is not performing, this becomes a user facing problem, potentially hard to solve.

Therefor the TARGET cluster is tested before the switchover is performed.

Standby nodes of the TARGET cluster are included in the load balancing
Standby nodes of the SOURCE cluster are removed from load balancing
- All write load is handled by SOURCE primary
- All read-only transactions are handled by TARGET standbys
Tests are made, metrics are monitored during the process
When certain everything works as expected, the write load will also be routed to TARGET

Upgrade Flowchart

flowchart TB
    subgraph Prepare new enviroment
    A[Create new cluster $TARGET as a carbon copy of the one to upgrade, $SOURCE.] --> B
    B[Attach $TARGET as a standby-only-cluster to $SOURCE via physical replication.] --> C
    end
    C[Make sure both clusters are in sync.] --> D1
    subgraph Upgrade: ansible-playbook upgrade.yml
    D1[Disable Chef] --> D
    D[Change from physical replication to logical.] --> E
    E[Make sure both clusters are in sync again.] --> G
    end
    G[Upgrade $TARGET to new version via pg_upgrade.] --> H
    subgraph Prepare switchover
    H[Make sure both clusters are in sync again.] --> I
    I[Merge Chef MRs so $TARGET uses roles for new PostgreSQL version] --> K
    K[Enable Chef, run Chef-Client] --> L
    L[Make sure Chef finished sucessfully and cluster is still operational] --> M
    M[Disable Chef again] --> N
    end
    N[Check metrics and sanity checks are as expectet] --> O
    subgraph Switchover: ansible-playbook switchover.yml
    O[Redirect RO traffic to $TARGET standbys in addition to $SOURCE] --> P
    P[Check if cluster is operational and metrics are normal] --"Normal"--> Q
    P --"Abnormal"-->GR
    Q[Redirect RO only to $TARGET] --> R
    R[Check if cluster is operational and metrics are normal] --"Normal"--> S
    R --"Abnormal"--> GR
    S[Quality team verify their tests run as expectet] --"Normal"--> T
    S --"Abnormal"-->GR
    end
    T[Switchover: Redirect RW traffict to $TARGET] --> U1
    subgraph Post Switchover Verification
    U1[Check if cluster is operational and metrics are normal]--"Normal"--> U2
    U1 --"Abnormal"--> LR
    U2[Enable Chef, run Chef-Client] --"Normal"--> U3
    U2 --"Abnormal"--> LR
    U3[Check if cluster is operational and metrics are normal,\nvalidate for $point_of_no_return minutes] --"Normal"--> Success
    U3 --"Abnormal"--> LR
    Success[Success!]
    end
    subgraph GR[Gracefull Rollback - no dataloss]
    GR1[Start gracefull rollback]
    end
    subgraph LR[Late Rollback - DATALOSS up to $point_of_no_return minutes]
    LR1[Start late rollback]
    end

Multi-Version Problem

During the process we have a window where:

All write and some read transactions are handled by SOURCE, running version x (PG12)
Some read transactions are handled by TARGET, running version y (PG14)

As highlighted by @DylanGriffith here db-migration!412 (comment 1375987600), it is possible that our application is not able to handle a situation, where not all transactions are processed by the same PostgreSQL major version.

Redirection Problem

As pointed out here gitlab-org/gitlab#364370 (closed), moving read-only queries between standbys may take significant time and behaves erratic.

Solution

This issue is intended to:

Highlight the problem
Determine if this is indeed a blocker for the upcoming upgrades
Find a viable solution if needed

Edited May 05, 2023 by Alexander Sosna