Skip to content

Postgres upgrade to PG14 – plan proposal: testing, benchmarks, and deployments

We're planning to upgrade Main, CI and Registry clusters to Postgres 13 or 14. Here I propose the plan for testing – including production tests – and deployment actions.

We aim to use logical replication, pg_upgrade, physical2logical conversion, and – optionally (to be decided after additional gprd tests) – pgBouncer's PAUSE/RESUME to achieve zero-downtime upgrade. Without pgBouncer's PAUSE/RESUME, we will get near-zero-downtime upgrade (brief downtime <1 minute).

Currently, the whole procedure implementing all steps mentioned above is automated in Ansible and tested in lower environments ("benchmarking"). Detailed description of the implementation are present in this MR.

Questions that need to be finalized sooner:

  • Should we upgrade straight to Postgres 14, skipping 13? Yes!
    • Cons: additional work will be needed, to cover changes for monitoring, chef, provisioning, backups, performance benchmarking to compare with PG12 etc. – although, there should be no significant additional work at this stage
    • Pros: ops overhead reduced (minus one major upgrade for both clusters), new features and better performance, especially in the are of btree bloat growth rates (btree deduplication) – "Numerous performance improvements have been made for parallel queries, heavily-concurrent workloads, partitioned tables, logical replication, and vacuuming" (release notes) – beneficial to handle the workload growth
  • Should we combine Postgres 13 upgrade with upgrade to n2 or n2d (discussed here and benchmarked here and here)
  • The Hardware upgrade is currently tested, if beneficial we will execute it before the PostgreSQL upgrade

Proposed plan:

  • physical2logical component developed and tested

  • Test logical + pg_upgrade + switchover (with PAUSE/RESUME) in "benchmarking" – test all steps, fix and improve playbooks

  • PG14 Performance benchmarking vs PG12 with synthetic load

  • PG14 Performance benchmarking vs PG12 with JMeter (subset of the production load)

  • dr-archive and dr-delayed PG upgrade using the playbook and validations, especially because we have both postgres and gitlab-rails per the issue - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15827

  • CR template (Issue to develop template

  • Issue for staging

  • grpd testing of the whole procedure excluding switchover (goals: study actual lags, step durations, etc. + have initial rehearsals)

    • test 1 in grpd-main, collect feedback and make adjustments – 2023-04-15 16:00Z (Saturday)
    • test 2 in gprd-main
    • optionally, test 3 in gprd-main
  • Upgrade [GSTG] using full procedure – {+ target time: 2023-04-26, early Q2 +} production#8448 (closed)

    • [GSTG] Registry
    • [GSTG] CI
    • [GSTG] Main
  • Ensure that all infrastructure components work well in [GSTG] (backups, monitoring, etc.) and develop fixes if needed

  • Performance testing the main cluster in gstg, to find possible regressions

  • Performance testing the ci cluster in gstg, to find possible regressions

  • Performance testing in gprd – add just one (1) standby node of new major version to the existing fleed and have face-to-face comparison for read-only load PG12 vs PG14

  • Implement and test the "forced" mode (near-zero-downtime – switchover w/o PAUSE/RESUME)

    • implementation – by end of March
    • tests in lower environments
  • Get review and approvals from the Database Group from Engineering (review from application point of view, with data collected during grpd tests)

  • Finalize Ansible playbooks upgrade.yml and switchover.yml (2 modes) and prepare documentation (runbooks)

    • upgrade.yml code
    • switchover.yml code
    • runbook for upgrade.yml
    • runbook for switchover.yml
  • Final rehearsals (at this point, only critical bugfixes are allowed)

    • run 1 (tbd which mode and which environment; if gprd, then it's without switchover)
    • run 2
    • run 3
  • Upgrade registry using the full procedure

    • gprd
    • including gstg registry-dr-archive and registry-dr-delayed
  • Deployment for gprd-main

    • The date is finalized, announcements are planned/made
    • DDL silence period coordinated
    • Index maintenance disabled for the deployment day
    • Deployment (tbd: detailed plan)
    • Should we stop Sidekiq for a brief period of time or forcefully terminate the sidekiq db sessions (tbd: stop or not - review and decide based on the earlier test results)
    • Post-deployment actions (tbd; to be included: backup-push, re-init of delayed and archive replicas, re-init of DLE)
  • Deployment for gprd-CI

    • The date is finalized, announcements are planned/made
    • DDL silence period coordinated
    • Index maintenance disabled for the deployment day
    • Deployment (tbd: detailed plan)
    • Should we stop Sidekiq for a brief period of time or forcefully terminate the sidekiq db sessions (tbd: stop or not - review and decide based on the earlier test results)
    • Deployment (tbd: detailed plan)
    • Post-deployment actions (tbd; to be included: backup-push, re-init of delayed and archive replicas, re-init of DLE)

This is a draft, any comments are welcome.

Edited by Alexander Sosna