Postgres upgrade to PG14 – plan proposal: testing, benchmarks, and deployments

We're planning to upgrade Main, CI and Registry clusters to Postgres ~~13 or~~ 14. Here I propose the plan for testing – including production tests – and deployment actions.

We aim to use logical replication, pg_upgrade, physical2logical conversion, and – optionally (to be decided after additional gprd tests) – pgBouncer's PAUSE/RESUME to achieve zero-downtime upgrade. Without pgBouncer's PAUSE/RESUME, we will get near-zero-downtime upgrade (brief downtime <1 minute).

Currently, the whole procedure implementing all steps mentioned above is automated in Ansible and tested in lower environments ("benchmarking"). Detailed description of the implementation are present in this MR.

Questions that need to be finalized sooner:

Should we upgrade straight to Postgres 14, skipping 13? Yes!
- Cons: additional work will be needed, to cover changes for monitoring, chef, provisioning, backups, performance benchmarking to compare with PG12 etc. – although, there should be no significant additional work at this stage
- Pros: ops overhead reduced (minus one major upgrade for both clusters), new features and better performance, especially in the are of btree bloat growth rates (btree deduplication) – "Numerous performance improvements have been made for parallel queries, heavily-concurrent workloads, partitioned tables, logical replication, and vacuuming" (release notes) – beneficial to handle the workload growth
Should we combine Postgres 13 upgrade with upgrade to n2 or n2d (discussed here and benchmarked here and here)
The Hardware upgrade is currently tested, if beneficial we will execute it before the PostgreSQL upgrade

Proposed plan:

This is a draft, any comments are welcome.

Edited Apr 26, 2023 by Alexander Sosna