add pg-upgrade-logical (!358) · Merge requests · GitLab.com / GitLab Infrastructure Team / db-migration

What's new

Added a new playbook upgrade.yml
The "switchover.yml" and "switchover_rollback.yml" playbooks has been moved to the "pg-upgrade-logical" directory (as part of the upgrade procedure).
Now the playbook "physical_to_logical.yml" is part of "upgrade.yml" (the old playbook has been deleted).

This playbook can be used to upgrade to any version of PostgreSQL (12->13, 12->14, 12->15, or 13->14, 14->15, etc).

TODO: Add README.md with a description of the plan for each playbook.

Plan

Step 1: PRE-UPGRADE: Perform Pre-Checks

(TARGET) Make sure that is Standby leader
(ALL) Test PostgreSQL DB Access
(TARGET) Check the current version of PostgreSQL
- Stop if, an incorrect version of PostgreSQL may have been specified
  - if the pg_old_version variable does not match the current version of PostgreSQL
- Stop if, an incorrect target version of PostgreSQL may have been specified
  - if the pg_new_version variable is equal to or less than the current version of PostgreSQL (no update required)
(SOURCE) Make sure there are no tables with replica identity "nothing"
- Stop, if replica identity "nothing" is present
(SOURCE) Make sure that tables with replica identity “default” have primary key
- Stop, if replica identity "default" without PK is present
(ALL) Make sure that the password file ".pgpass" exists
- Stop, if .pgpass is not exists
(SOURCE) Make sure that the user is specified in the password file
- Stop, if the user (pg_source_replica_user variable) is not present
(SOURCE) Get the password for the user from the password file
(TARGET) Configure the password file for the user
Test the access to the source database
- Note: User from the pg_source_replica_user variable to database from the pg_source_dbname variable
(TARGET) Make sure that physical replication is active
- Stop, if there are no active replicas
- if pg_stat_replication returned 0 entries (except pg_basebackup)
(TARGET) Make sure there is no high physical replication lag
- No more than max_replication_lag_bytes
- Stop, if physical replication lag is high
(TARGET) Rsync Checks
- Create /tmp/testrsync file on Primary
- Test rsync and ssh key access
- Cleanup testrsync file
(TARGET) Stop Chef client to prevent unexpected changes during the process

Step 2: PRE-UPGRADE: Install new PostgreSQL packages

Update apt cache
Make sure that the new PostgreSQL packages are installed (install if missing)
Make sure that the rsync package are installed

Step 3: PRE-UPGRADE: Initialize new db, schema compatibility check, and pg_upgrade --check

Make sure new PostgreSQL data directory exists (create directories)
Get the current encodig and data_checksums settings
Initialize a new PostgreSQL data directory on the Primary
- with the same encoding
- with enabled checksums (if they are enabled in the current version of PostgreSQL)
Check the compatibility of the database schema with the new PostgreSQL
- using pg_dumpall --schema-only
Clear a new PostgreSQL data directory and re-initdb (after checking the scheme)
Verify the two clusters are compatible
- using pg_upgrade --check (check clusters only, don't change any data)

Step 4: Create a publication not the Source and reach `recovery_target_lsn` on the Target leader

Execute CHECKPOINT before stopping PostgreSQL
Stop Patroni service on the Cluster Replica
Stop Patroni service on the Cluster Leader
Check if old PostgreSQL is stopped
Check if new PostgreSQL is stopped
- additional checks using pg_ctl
- if running, stop PostgreSQL using pg_ctl
(SOURCE) Execute checkpoint and create a publication for logical replication
(SOURCE) Create a slot for logical replication, and save 'lsn'
(TARGET) Specify recovery parameters on the Standby Cluster Leader
- recovery_target_lsn = '{{ pg_slot_lsn }}'
- recovery_target_action = 'promote'
- recovery_target_timeline = 'latest'
(TARGET) Start PostgreSQL on the Standby Cluster Leader to reach recovery_target_lsn
(TARGET) Wait until the recovery is complete
(TARGET) Get the current PostgreSQL log file
(TARGET) Check the PostgreSQL log file
- Stop, if target LSN not reached
Print the result of checking the PostgreSQL log
(TARGET) Execute CHECKPOINT before stopping postgresql
(TARGET) Stop PostgreSQL after reaching the recovery target
(TARGET) Remove the Patroni standby_cluster parameters from the DCS

Step 5: PRE-UPGRADE: Prepare the Patroni configuration

Edit patroni.yml | update parameters: data_dir, bin_dir, config_dir
Edit patroni.yml | remove parameters: standby_cluster (if exists)
Copy the pg_hba.conf file to a new PostgreSQL (to save pg_hba rules)

Step 6: UPGRADE: Upgrade PostgreSQL

Upgrade the PostgreSQL to new version on the Primary
- using pg_upgrade --link (hard-links)
Upgrade the PostgreSQL on the Replica
- using rsync (hard-links)
  - Note: before rsync is executed, the owner of the PostgreSQL data directories changes to the user from under whom the playbook is executed, to use ssh agent forwarding. After rsync is executed, the owner of the directory returns to the user defined in the os_pg_user variable (gitlab-psql)
Remove existing cluster "target-postgres-cluster" from DCS
Start Patroni service on the Cluster Leader
- Wait for Patroni port to become open on the host
- Check Patroni is healthy on the Leader
Start Patroni service on the Cluster Replica
- Wait for Patroni port to become open on the host
- Check Patroni is healthy on the Replica

Step 7: POST-UPGRADE: Check Replication and Update extensions

Make sure that physical replication is active
- Check pg_stat_replication (count records, except pg_basebackup)
Create a table "test_replication" with 10000 rows on the Primary
Wait until the PostgreSQL replica is synchronized
- Check pg_stat_replication (count records, except pg_basebackup)
- We wait (no more than 10 minutes) until the replication lag is 0 bytes
Verify table record on Replicas
- The PostgreSQL Replication is Ok. If the number of records in the test_replication table the same as the Primary.
- Error, if the number of records does not match the Primary.
Drop a table "test_replication"
Update old extensions
- Get a list of old PostgreSQL extensions
- Update all old PostgreSQL extensions
  - excluding 'pg_repack' as it requires re-creation to update
- Recreate old PostgreSQL pg_repack extension
- if an update is required

Step 8: POST-UPGRADE: Create a subscription for logical replication

(TARGET) Create a subscription for logical replication using a previously created slot
(SOURCE) Make sure that logical replication is active
- wait until the logical replication slot is active (max wait time: 1 minute)
(SOURCE) Check the logical replication lag
Print the result of setting up logical replication
- example: "The logical replication setup is completed. Current replication lag: XX MB"

Step 9: POST-UPGRADE: Analyze a PostgreSQL database (update optimizer statistics)

Run vacuumdb to analyze
- for a database defined in the pg_target_dbname variable
- All CPU cores of the server are used

Requirements

Prepare the Patroni Standby Cluster from a backup or disk snapshot (not part of this playbook).
Make sure your ssh-agent has your identity loaded (ssh-add -L), since we use agent forwarding by default (configured in ansible.cfg), to upgrade replicas using rsync.

Example of a playbook execution

ansible-playbook -i inventory/db-benchmarking-main.yml upgrade.yml -e "pg_old_version=12 pg_new_version=13"

Edited Mar 01, 2023 by Vitaliy Kukharik

add pg-upgrade-logical

What's new

Plan

Step 1: PRE-UPGRADE: Perform Pre-Checks

Step 2: PRE-UPGRADE: Install new PostgreSQL packages

Step 3: PRE-UPGRADE: Initialize new db, schema compatibility check, and pg_upgrade --check

Step 4: Create a publication not the Source and reach recovery_target_lsn on the Target leader

Step 5: PRE-UPGRADE: Prepare the Patroni configuration

Step 6: UPGRADE: Upgrade PostgreSQL

Step 7: POST-UPGRADE: Check Replication and Update extensions

Step 8: POST-UPGRADE: Create a subscription for logical replication

Step 9: POST-UPGRADE: Analyze a PostgreSQL database (update optimizer statistics)

Requirements

Example of a playbook execution

Merge request reports

Step 4: Create a publication not the Source and reach `recovery_target_lsn` on the Target leader