pg-upgrade-logical improvements (2024, part 1)
Issue: #24
What's new
-
Pre-upgrade tasks (avoid manual actions)
- Check WALG_GS_PREFIX and clear WAL-G backups and WALs in the GCS bucket
- First, check that the WALG_GS_PREFIX on the target and source clusters are different.
- if the same, the playbook will stop with an error
- Else, clear WAL-G backups and WALs in the GCS bucket
- if
clear_target_cluster_walg_backups
is 'true' (default is 'true')
- if
- First, check that the WALG_GS_PREFIX on the target and source clusters are different.
- Prepare the parameters for PostgreSQL (compatibility with Postgres 16)
- Check the removed or renamed parameters starting from version 14 to 16 version.
- Check WALG_GS_PREFIX and clear WAL-G backups and WALs in the GCS bucket
-
Post-Upgrade tasks (avoid manual actions)
- Monitor the locks and terminate the backend blocking the 'ANALYZE' query
- To prevent blocking, we execute the '
pg_terminator
' script during statistics collection. - Previously, this check was performed manually, now this task allows us to avoid this manual action.
- To prevent blocking, we execute the '
- Execute WAL-G backup (in the background) after the upgrade
- Update PostgreSQL data directory path in
/opt/wal-g/bin/backup.sh
script- Note: if the script is specified in the
backup_command
variable (by default, it is specified)
- Note: if the script is specified in the
- Run WAL-G backup script in the background (if
backup_after_upgrade
is 'true' (default 'true'))- Note: Ansible only initiates the backup script execution (see
backup_command
variable) and does not wait for it to finish (because it is a long-running job). - It is recommended to check the backup status after N hours (depending on the size of the database) using the command:
/usr/bin/envdir /etc/wal-g.d/env /opt/wal-g/bin/wal-g backup-list
- Note: Ansible only initiates the backup script execution (see
- Update PostgreSQL data directory path in
- Validate indexes using
pg_amcheck
after the upgrade - background job (ifpg_amcheck
is 'true' (default 'false'))- Note: pg_amcheck runs on the Replica server by default. If the '--parent-check' option is added, then it will be launched on the Primary server.
- Note: Ansible only initiates the pg_amcheck execution and does not wait for it to finish (because it is a long-running job).
- It is recommended to check the pg_amcheck result after N hours (depending on the size of the database) in the log file
/tmp/pg_amcheck.log
('pg_amcheck_log_path' variable) on the Target cluster leader.
- Delete old data directory and old Postgres packages after the upgrade
- Only after a successful upgrade. Before deleting the data directory, make sure that the new directory is now used to prevent the deletion of the current data directory that is being used.
- Monitor the locks and terminate the backend blocking the 'ANALYZE' query
-
Switchover improvements
- Multi-slot for reverse logical replication
- New variable:
pg_publication_count_reverse
, default4
- New variable:
- Split tables into groups at the SQL level instead of using a loop in ansible, to speed up the process.
- Note: Previously, it took a lot of time with a large number of tables. If
pg_publication_count
orpg_publication_count_reverse
more then '1'
- Note: Previously, it took a lot of time with a large number of tables. If
- Split the switchover into separate playbooks to avoid potential mistakes
- Playbooks:
switchover_replica.yml
,switchover_leader.yml
- Playbooks:
- Avoid the risks of making unwanted changes to the old database after the switchover R/W traffic to the new cluster
- replace
pg_hba.conf
file to allow only local connections (as superuser only), and replication connections to keep physical and logical replication in an active state.
- replace
- Pause Patroni on the Source Cluster (old database) after the switchover R/W traffic to the new cluster
- disable auto-failover (if
enable_reverse_logical_replication
is 'true')
- disable auto-failover (if
- Re-check the lag of logical replication before the switchover
- Attempt to wait for the window without logical replication lag before switchover. Max wait time: 5 minutes.
- Note: if the lag is more than 0 bytes, but less than the allowed one, continue the switchover. Else, stop if the lag is high (more than
max_replication_lag_bytes
(if force_mode: false) orforce_mode_max_replication_lag_bytes
(if force_mode: true))
- Avoid the risk of a long PAUSE of the PgBouncer pools
- If, after reaching the timeout (60 seconds) the logical replication lag is higher than 0 bytes, RESUME will be executed and the playbook will stop with an error and a request to try again later.
- Multi-slot for reverse logical replication
Edited by Vitaliy Kukharik