pg-upgrade-logical improvements (2024, part 1) (!512) · Merge requests · GitLab.com / GitLab Infrastructure Team / db-migration

Vitaliy Kukharik requested to merge pg-upgrade-logical-improvements into master Mar 18, 2024

Issue: #24

What's new

Pre-upgrade tasks (avoid manual actions)
1. Check WALG_GS_PREFIX and clear WAL-G backups and WALs in the GCS bucket
  - First, check that the WALG_GS_PREFIX on the target and source clusters are different.
    - if the same, the playbook will stop with an error
  - Else, clear WAL-G backups and WALs in the GCS bucket
    - if clear_target_cluster_walg_backups is 'true' (default is 'true')
2. Prepare the parameters for PostgreSQL (compatibility with Postgres 16)
  - Check the removed or renamed parameters starting from version 14 to 16 version.
Post-Upgrade tasks (avoid manual actions)
1. Monitor the locks and terminate the backend blocking the 'ANALYZE' query
  - To prevent blocking, we execute the 'pg_terminator' script during statistics collection.
  - Previously, this check was performed manually, now this task allows us to avoid this manual action.
2. Execute WAL-G backup (in the background) after the upgrade
  - Update PostgreSQL data directory path in /opt/wal-g/bin/backup.sh script
    - Note: if the script is specified in the backup_command variable (by default, it is specified)
  - Run WAL-G backup script in the background (if backup_after_upgrade is 'true' (default 'true'))
    - Note: Ansible only initiates the backup script execution (see backup_command variable) and does not wait for it to finish (because it is a long-running job).
    - It is recommended to check the backup status after N hours (depending on the size of the database) using the command: /usr/bin/envdir /etc/wal-g.d/env /opt/wal-g/bin/wal-g backup-list
3. Validate indexes using pg_amcheck after the upgrade - background job (if pg_amcheck is 'true' (default 'false'))
  - Note: pg_amcheck runs on the Replica server by default. If the '--parent-check' option is added, then it will be launched on the Primary server.
  - Note: Ansible only initiates the pg_amcheck execution and does not wait for it to finish (because it is a long-running job).
  - It is recommended to check the pg_amcheck result after N hours (depending on the size of the database) in the log file /tmp/pg_amcheck.log ('pg_amcheck_log_path' variable) on the Target cluster leader.
4. Delete old data directory and old Postgres packages after the upgrade
  - Only after a successful upgrade. Before deleting the data directory, make sure that the new directory is now used to prevent the deletion of the current data directory that is being used.
Switchover improvements
1. Multi-slot for reverse logical replication
  - New variable: pg_publication_count_reverse, default 4
2. Split tables into groups at the SQL level instead of using a loop in ansible, to speed up the process.
  - Note: Previously, it took a lot of time with a large number of tables. If pg_publication_count or pg_publication_count_reverse more then '1'
3. Split the switchover into separate playbooks to avoid potential mistakes
  - Playbooks: switchover_replica.yml, switchover_leader.yml
4. Avoid the risks of making unwanted changes to the old database after the switchover R/W traffic to the new cluster
  - replace pg_hba.conf file to allow only local connections (as superuser only), and replication connections to keep physical and logical replication in an active state.
5. Pause Patroni on the Source Cluster (old database) after the switchover R/W traffic to the new cluster
  - disable auto-failover (if enable_reverse_logical_replication is 'true')
6. Re-check the lag of logical replication before the switchover
  - Attempt to wait for the window without logical replication lag before switchover. Max wait time: 5 minutes.
  - Note: if the lag is more than 0 bytes, but less than the allowed one, continue the switchover. Else, stop if the lag is high (more than max_replication_lag_bytes (if force_mode: false) or force_mode_max_replication_lag_bytes (if force_mode: true))
7. Avoid the risk of a long PAUSE of the PgBouncer pools
  - If, after reaching the timeout (60 seconds) the logical replication lag is higher than 0 bytes, RESUME will be executed and the playbook will stop with an error and a request to try again later.

Edited Apr 09, 2024 by Vitaliy Kukharik

pg-upgrade-logical improvements (2024, part 1)

What's new

Merge request reports