Possible blocker for general PostgreSQL upgrade strategy
PostgreSQL upgrade with (near) zero downtime
In order to fulfill the requirements of executing database upgrade without (or very minimal) user facing downtime, the following procedure was chosen.
- A second cluster,
TARGET
, is created as carbon copy ofSOURCE
-
TARGET
is synchronized to SOURCE via physical replication -
TARGET
is converted to sync via logical replication -
TARGET
is upgraded to newer major version- upgrade takes some time
- cluster is no longer in sync
-
TARGET
is synced and kept in synced via logical replication (asynchronously)- logical replication takes time and is resource intensive
At this point we have two clusters,SOURCE
running the old major version, TARGET
the new.
Both clusters are synced asynchronously, so small delays are to be accepted.
Logical replication is less performant, so the delays are higher than between the physically replicate nodes within one cluster.
Now we need to decide if and when the application traffic should be routed to TARGET
.
We could just make a hard cut and redirect all traffic to the new cluster at one point in time. If the cluster is not performing, this becomes a user facing problem, potentially hard to solve.
Therefor the TARGET
cluster is tested before the switchover is performed.
- Standby nodes of the
TARGET
cluster are included in the load balancing - Standby nodes of the
SOURCE
cluster are removed from load balancing- All write load is handled by
SOURCE
primary - All read-only transactions are handled by
TARGET
standbys
- All write load is handled by
- Tests are made, metrics are monitored during the process
- When certain everything works as expected, the write load will also be routed to
TARGET
Upgrade Flowchart
flowchart TB
subgraph Prepare new enviroment
A[Create new cluster $TARGET as a carbon copy of the one to upgrade, $SOURCE.] --> B
B[Attach $TARGET as a standby-only-cluster to $SOURCE via physical replication.] --> C
end
C[Make sure both clusters are in sync.] --> D1
subgraph Upgrade: ansible-playbook upgrade.yml
D1[Disable Chef] --> D
D[Change from physical replication to logical.] --> E
E[Make sure both clusters are in sync again.] --> G
end
G[Upgrade $TARGET to new version via pg_upgrade.] --> H
subgraph Prepare switchover
H[Make sure both clusters are in sync again.] --> I
I[Merge Chef MRs so $TARGET uses roles for new PostgreSQL version] --> K
K[Enable Chef, run Chef-Client] --> L
L[Make sure Chef finished sucessfully and cluster is still operational] --> M
M[Disable Chef again] --> N
end
N[Check metrics and sanity checks are as expectet] --> O
subgraph Switchover: ansible-playbook switchover.yml
O[Redirect RO traffic to $TARGET standbys in addition to $SOURCE] --> P
P[Check if cluster is operational and metrics are normal] --"Normal"--> Q
P --"Abnormal"-->GR
Q[Redirect RO only to $TARGET] --> R
R[Check if cluster is operational and metrics are normal] --"Normal"--> S
R --"Abnormal"--> GR
S[Quality team verify their tests run as expectet] --"Normal"--> T
S --"Abnormal"-->GR
end
T[Switchover: Redirect RW traffict to $TARGET] --> U1
subgraph Post Switchover Verification
U1[Check if cluster is operational and metrics are normal]--"Normal"--> U2
U1 --"Abnormal"--> LR
U2[Enable Chef, run Chef-Client] --"Normal"--> U3
U2 --"Abnormal"--> LR
U3[Check if cluster is operational and metrics are normal,\nvalidate for $point_of_no_return minutes] --"Normal"--> Success
U3 --"Abnormal"--> LR
Success[Success!]
end
subgraph GR[Gracefull Rollback - no dataloss]
GR1[Start gracefull rollback]
end
subgraph LR[Late Rollback - DATALOSS up to $point_of_no_return minutes]
LR1[Start late rollback]
end
Multi-Version Problem
During the process we have a window where:
- All write and some read transactions are handled by
SOURCE
, running versionx
(PG12) - Some read transactions are handled by
TARGET
, running versiony
(PG14)
As highlighted by @DylanGriffith here db-migration!412 (comment 1375987600), it is possible that our application is not able to handle a situation, where not all transactions are processed by the same PostgreSQL major version.
Redirection Problem
As pointed out here gitlab-org/gitlab#364370 (closed), moving read-only queries between standbys may take significant time and behaves erratic.
Solution
This issue is intended to:
- Highlight the problem
- Determine if this is indeed a blocker for the upcoming upgrades
- Find a viable solution if needed