Postgres upgrade to PG14 – plan proposal: testing, benchmarks, and deployments
We're planning to upgrade Main, CI and Registry clusters to Postgres 13 or 14. Here I propose the plan for testing – including production tests – and deployment actions.
We aim to use logical replication, pg_upgrade, physical2logical conversion, and – optionally (to be decided after additional gprd tests) – pgBouncer's PAUSE/RESUME to achieve zero-downtime upgrade. Without pgBouncer's PAUSE/RESUME, we will get near-zero-downtime upgrade (brief downtime <1 minute).
Currently, the whole procedure implementing all steps mentioned above is automated in Ansible and tested in lower environments ("benchmarking"). Detailed description of the implementation are present in this MR.
Questions that need to be finalized sooner:
-
Should we upgrade straight to Postgres 14, skipping 13? Yes! - Cons: additional work will be needed, to cover changes for monitoring, chef, provisioning, backups, performance benchmarking to compare with PG12 etc. – although, there should be no significant additional work at this stage
- Pros: ops overhead reduced (minus one major upgrade for both clusters), new features and better performance, especially in the are of btree bloat growth rates (btree deduplication) – "Numerous performance improvements have been made for parallel queries, heavily-concurrent workloads, partitioned tables, logical replication, and vacuuming" (release notes) – beneficial to handle the workload growth
-
Should we combine Postgres 13 upgrade with upgrade to n2 or n2d (discussed here and benchmarked here and here) - The Hardware upgrade is currently tested, if beneficial we will execute it before the PostgreSQL upgrade
Proposed plan:
-
physical2logical component developed and tested -
tests in lower environments (synthetic and "benchmarking") -
tests in gstg production#7969 (closed) -
gprd CI cluster production#8187 (closed) -
gprd Main cluster, 1st run production#8217 (closed) -
gprd Main cluster, 2nd run production#8290 (closed)
-
-
Test logical + pg_upgrade + switchover (with PAUSE/RESUME) in "benchmarking" – test all steps, fix and improve playbooks -
test 1, w/o load https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17383 -
test 2, w/o load https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17383#note_1299400178 -
test 3, with synthetic load https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17384#note_1309684455 -
test 4, with synthetic load, https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17384#note_1325967647 -
test 5, with synthetic load - a complete test including validating chef role, MR - similar to the actual upgrade in higher environments (gstg, gprd) - db-migration!358 (comment 1314466898) -
test 6? Fix issues identified in earlier test(s) -
Questions for testing -
1. Are we going to validate PAUSE/RESUME ("Plan A") only in "benchmarking" env or also "Plan B" - "forced" mode - near-zero-downtime? -
2. Are we also planning to simulate &/ validate DR Archive and Delayed in "benchmarking" - db-migration!358 (comment 1296050291) ?
-
-
PG14 Performance benchmarking vs PG12 with synthetic load -
PG14 Performance benchmarking vs PG12 with JMeter (subset of the production load) -
dr-archive and dr-delayed PG upgrade using the playbook and validations, especially because we have both postgres and gitlab-rails per the issue - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15827 -
CR template (Issue to develop template -
grpd testing of the whole procedure excluding switchover (goals: study actual lags, step durations, etc. + have initial rehearsals) -
test 1 in grpd-main, collect feedback and make adjustments – 2023-04-15 16:00Z (Saturday) -
test 2 in gprd-main -
optionally, test 3 in gprd-main
-
-
Upgrade [GSTG] using full procedure – {+ target time: 2023-04-26, early Q2 +} production#8448 (closed)-
[GSTG] Registry -
[GSTG] CI -
[GSTG] Main
-
-
Ensure that all infrastructure components work well in [GSTG] (backups, monitoring, etc.) and develop fixes if needed -
Performance testing the main cluster in gstg, to find possible regressions -
Performance testing the ci cluster in gstg, to find possible regressions -
Performance testing in gprd – add just one (1) standby node of new major version to the existing fleed and have face-to-face comparison for read-only load PG12 vs PG14 -
Implement and test the "forced" mode (near-zero-downtime – switchover w/o PAUSE/RESUME) -
implementation – by end of March -
tests in lower environments
-
-
Get review and approvals from the Database Group from Engineering (review from application point of view, with data collected during grpd tests) -
Finalize Ansible playbooks upgrade.yml
andswitchover.yml
(2 modes) and prepare documentation (runbooks)-
upgrade.yml
code -
switchover.yml
code -
runbook for upgrade.yml
-
runbook for switchover.yml
-
-
Final rehearsals (at this point, only critical bugfixes are allowed) -
run 1 (tbd which mode and which environment; if gprd, then it's without switchover) -
run 2 -
run 3
-
-
Upgrade registry using the full procedure -
gprd -
including gstg registry-dr-archive and registry-dr-delayed
-
-
Deployment for gprd-main -
The date is finalized, announcements are planned/made -
DDL silence period coordinated -
Index maintenance disabled for the deployment day -
Deployment (tbd: detailed plan) -
Should we stop Sidekiq for a brief period of time or forcefully terminate the sidekiq db sessions (tbd: stop or not - review and decide based on the earlier test results) -
Post-deployment actions (tbd; to be included: backup-push, re-init of delayed and archive replicas, re-init of DLE)
-
-
Deployment for gprd-CI -
The date is finalized, announcements are planned/made -
DDL silence period coordinated -
Index maintenance disabled for the deployment day -
Deployment (tbd: detailed plan) -
Should we stop Sidekiq for a brief period of time or forcefully terminate the sidekiq db sessions (tbd: stop or not - review and decide based on the earlier test results) -
Deployment (tbd: detailed plan) -
Post-deployment actions (tbd; to be included: backup-push, re-init of delayed and archive replicas, re-init of DLE)
-
This is a draft, any comments are welcome.