[GPRD][PG14] Open ToDo's for v14 Upgrade (CI + Main)
This is the single source of truth, which tasks need to be done before the upgrade on 2023-08-12
.
ToDo
CI Cluster
-
Define date of the new CI Upgrade attempt and start communication process- @kwanyangu -
Define date of the new Main Upgrade attempt and start communication process - @kwanyangu -
Review and test the fix to avoid data corruption during the upgrade playbook https://gitlab.com/gitlab-com/gl-infra/production/-/issues/15925#note_1455364434 - @bshah11 -
Perform CI PG14 upgrade test in db-benchmakring
- Review and test the fix to avoid data corruption during the upgrade playbook https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24150 - @bshah11 -
Rebuild GPRD CI v14 Standby Cluster and perform upgrade test - production#16125 (closed) https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24205 - @bshah11 , @rhenchen.gitlab -
If you have time, you might want to perform the GPRD PG14 upgrade test one more time after finalizing all of the __TODO__s, just rebuild the target cluster and perform it. -
We should do a switchover test in the db-benchmakring
environment -
Prepare for actual GPRD CI PG14 upgrade and cur over to v14. We need to rebuild GPRD CI v14 Standby Cluster again - @rhenchen.gitlab -
Recreate the CR with the added changes: TODO
Non-blocking, after the CI Cluster upgrade
-
Post PG14 Upgrade tasks including rebuilding all B-tree indexes - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24136
Main Cluster
-
[GPRD][PG14] Fix pgBouncer configuration and Upgrade pgbouncer to a recent version - @rhenchen.gitlab -
Optimizing logical replication- We agree to not introduce this last minute and postpone til the next upgrade, https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23951#note_1440071610. -@NikolayS
-
Issue template updates (noted from our CI GPRD execution) - @rhenchen.gitlab - Modify the issue template to split the upgrade in Part1 and Part2
- Remove terminate of slow transactions from part 2 (it should be done just during part 1, before the upgrade playbook)
- Fix the Teleport check part, the commands aren't working and it should be done by the SRE and not DBRE
**PRODUCTION ONLY** :elephant: {+DBRE +}: **TODO**: TEST THIS: Confirm that Teleport access works properly and we're hitting the right clusters:
- Add 2x silences for blackbox restore operations in the old and new clusters for at least 24 hours
- Chatops Feature flags and tasks can be done in a single task, with a single message to @db-team
-
Create PG12 Source cluster and PG12 Target standby cluster in the db-benchmarking
env using GPRD v12 Main/CI cluster disk snapshot - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24270 - @bshah11 -
Perform the Main PG14 upgrade test in db-benchmarking
- Review and test the fix to avoid data corruption during the upgrade playbook https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24277 - @bshah11 A few open questions for the team for the testing in thedb-benchmarking
-
Are we planning to implement and test reverse replication for the GPRD Main upgrade? -
Tested reverse replication
-
-
Are we planning PG14 switchover playbook testing with the new PGBounce changes introduced via https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23988? -
Performed test with latest PGBounce changes
-
-
Are we planning to test with only a single replication slot? -
Performed test with multiple-slots
-
-
-
We should do a switchover test in the db-benchmarking
environment - @bshah11 -
Build GPRD Main v14 Standby Cluster. It should be ready for the weekend testing before 2023-08-19
- @alexander-sosna @rhenchen.gitlab -
Similar to the CI cluster, for the Main GPRD cluster, @NikolayS would like to collect data for query execution plan comparisons (v12 vs. v14) for the "performance test" - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24312 - @NikolayS -
Recreate PG12 Target standby cluster in the db-benchmarking
env using Source v12 CI cluster's disk snapshot - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24270#note_1517996492 - @bshah11 -
CANCELLED Perform a Final PG14 upgrade test in thedb-benchmarking
- Test # 2 on2023-08-18
with enhancedpgbench
workload, switchover, reverse replication, and latest pgbouncer changes. At this stage, we finalize and freeze the ansible playbook - @bshah11 @NikolayS -
Perform upgrade test (Test 1) with a single logical replication slot 2023-08-19
production#16194 (closed) - @bshah11 @NikolayS -
CANCELLED Based on the test results from Test 1, we might decide to perform another test (Test 2) with multiple logical replication slots. We need to rebuild GPRD Main v14 Standby Cluster again, and perform another test (Test 2) on2023-08-21
-
Rebuild GPRD Main v14 Standby Cluster again on2023-08-21 after 00:30
when Rafa starts his day - @rhenchen.gitlab @alexander-sosna -
Perform upgrade test in GPRD Main with multiple logical replication slots on2023-08-22 after 00:30
- @rhenchen.gitlab @NikolayS
-
-
I feel we should not pay for the Target v14 cluster Hardware (eightDestroy GPRD Main v14 Standby Clustern2-highmem-128
VMs) as we will not be using it and it is pretty expensive.2023-08-21 after 00:30
when Rafa starts his day - @rhenchen.gitlab @alexander-sosna -
Perform an additional PG14 upgrade test in the db-benchmarking
- Test # 2 on2023-08-22
with enhancedpgbench
workload, switchover, reverse replication with a single slot, and latest pgbouncer changes. This is primarily to validate reverse replication with a single slot - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24304 - @bshah11 @NikolayS -
Prepare for the actual GPRD Main PG14 upgrade and cut over to v14. We need to rebuild GPRD Main v14 Standby Cluster again - @alexander-sosna @rhenchen.gitlab -
Recreate the CR with the added changes: @bshah11 -
Reduce time for service discovery to update after database replica dns change: @praba.m7n @stomlinson -
Consul configuration files required for the upgrade were missing on a patroni replica patroni-main-2004-102
- Implement fix and validate -
PG12->PG14 upgrade of gprd-main and query performance degradation - @NikolayS -
Update the pg14 issue template based on the TODOs identified during the period PG14 upgrade attempt - production#14403 (comment 1530821685) - -
Evaluate options to consolidate multiple chatops
commands during the upgrade/maintenance into just onechatops
command (will be more efficient and it will help avoid human errors) - @stomlinson - gitlab-org/gitlab#417161 (closed) -
Define the date of the next Main Upgrade attempt and start the communication process- @kwanyangu -
TBD - Create PG12 Source cluster and PG12 Target standby cluster in the db-benchmarking
env using GPRD v12 Main/CI cluster disk snapshot -
TBD - Final PG14 upgrade test in the db-benchmarking
- Validate findings from https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24339 and also perform switchover, reverse replication, and latest pgbouncer changes. At this stage, we finalize and freeze the Ansible playbook -
Rebuild GPRD Main v14 Standby Cluster -
Recreate the CR with the latest issue template changes:
Non-blocking, after the Main Cluster upgrade
-
Post PG14 Upgrade tasks including rebuilding all B-tree indexes - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24136
Edited by Kennedy Wanyangu