[GPRD][PG14] Open ToDo's for v14 Main Upgrade
This is the single source of truth, which tasks need to be done before the upgrade on 2023-09-09 .
Everyone feel free to open their own issues or MRs to track activities/changes, but please link them here so we can keep track of them.
Communication and Planning [Required for the upgrade]
DRI: @kwanyangu @bshah11
-
Define the date of the next Main Upgrade attempt and start the communication process - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/11377#note_1531667206 -
Send calendar invites for the Upgrade Part 1
andPart 2
-
Open 2 new CRs -
Upgrade CR - production#16266 (closed) -
Ops. CR - https://ops.gitlab.net/gitlab-com/gl-infra/db-migration/-/issues/66 -
Update the ops CR once the Issue Template TODOs are completed
-
-
Schedule maintenance mode to exclude upgrade window from SLA
Queries performance investigation - [Required for the upgrade]
DRI: @NikolayS and @stomlinson
-
Detailed analysis of all RO queries timing out and troubleshooting of plans in PG12&PG14 clones - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24339
Issue Template - [Required for the upgrade]
DRI @rhenchen.gitlab
-
IMPORTANT: partitioning wasn't stopped by disabling background migrations, add disable of partitioning on T-2 days: chatops disable FF
task-
we missed to add /chatops run feature set partition_manager_sync_partitions false
in the issue template as recommended by Simon from DB team on July 11th
-
-
Add checks if feature flags were disabled before Part 1 and Part 2 (eg. /chatops run feature get
) -
The start time for silencing alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
is wrong, it should beT-14 hours
, and notT zero
and check if silencing 29 hours is enough -
On Part1 pgbouncer connection check should check for the target clusters nodes only and not all nodes, eg. pgbouncer_pools_client_active_connections{env="gprd", fqdn=~"patroni-main-v14.*"}
-
Add a step on Part 1 to wait for the logical replication to get in sync before starting the post-upgrade checks -
Improve pg_amcheck
to avoid blocking autovacuum for too long, check if we can usestatement_timeout
of 30 minutes (export PGOPTIONS="-c statement_timeout=30min" pg_amcheck ...
) -
Fix query on Part 2 - "Double check that no pg_amcheck
processes nor queries are running on the v14 Replica Nodes." -
Fix part2 ps -ef pg_amcheck
tops -ef |grep pg_amcheck
-
Fix part2 - Remove step to kill autovacuum workers -
Fix part2 - remove duplicate Confirm chef-client is disabled thanos link
-
Fix part2 - Check RO activity in v14 after the first switchover -
Fix part2 - Move the Reverse replication check and drop of subscriptions to the Wrap Up section -
Fix part2 - Communication at the end of switchover -
Fix rollback section - steps to drop logical replication -
Fix rollback section - there's 2 messages that the rollback is complete (1 is enough) -
Fix rollback section - Alerts: keep silences for the new cluster, and remove silence of the old cluster -
Fix rollback section - Add task to Re-enable chef -
Add a Silence for SidekiqServiceSidekiqExecutionErrorSLOViolationSingleShard
, see: production#14403 (comment 1531040326) -
Replace disable FFs by disallow_database_ddl_feature_flags in the issue template - db-migration!488 (merged)
Ansible Playbooks
Switchover Playbook - [Required for the upgrade]
-
When checking for consul config files, the number of files can be different on the nodes so we had an error Aug 26 15:18:41 failed: [patroni-main-2004-102-db-gprd.c.gitlab-production.internal] (item=db-replica-9.json) => {"ansible_loop_var": "item", "changed": false, "item": "db-replica-9.json", "msg": "Path /etc/consul/conf.d/db-replica-9.json does not exist !", "rc": 257}
- db-migration!485 (merged)- Not all servers are created the same (they might have different number of pgbouncer processes)
-
The step name: "(TARGET) Reload consul.service"
should not depend on(service_file is defined and service_file.changed)
as if the script failed and is re-executed the files might be already changed - db-migration!484 (merged)
Upgrade Playbook
db-benchmarking
environment - [Required for the upgrade]
Upgrade test in DRI: @bshah11, @NikolayS and @vitabaks
-
Create PG12 Source cluster and PG12 Target standby cluster in the db-benchmarking
env using GPRD v12 Main/CI cluster disk snapshot - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24382 -
Create pgbouncer VMs -
Final PG14 upgrade test in the db-benchmarking
- Validate the recent playbook changes and also perform switchover, and reverse replication, with the latest pgbouncer changes. At this stage, we finalize and freeze the Ansible playbook - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24383
Infrastructure and Configuration management
DRI: @anganga @alexander-sosna
PG14 Settings - Chef roles
-
Create an MR to disable enable_memoize
- MR https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3935
Terraform v14 cluster - [Required for the upgrade]
-
Rebuild patroni-main-v14
-
the nodes 101
,103
and104
should be on different AZs -
the nodes 106
,107
and108
should be marked asnofailover
-
Check if syslog
was created in all nodes, if it wasn't try restarting the VM to fix issue
-
-
[GPRD] Rebuild delayed and archive replicas for the Main-v14 cluster -
Create CR with necessary steps (utilize learnings from the CI cluster) and update the issue template to have the correct CR in ops.gitlab.net
upgrade issue
-
Backend Engineering - [Optional but nice-to-have]
DRI: @alexives
-
[optional: great to have] speed up the connection draining after nodes are removed out of db-replica.service.consul.
Consul/DNS (it's taking up to 40 minutes); - gitlab-org/gitlab#364370 (closed)- This might also be due to Replicas pgbouncer connection life time
-
[optional: good to have] single disable feature flag for database maintenance; - gitlab-org/gitlab#417161 (closed)
Quality - [Optional but nice-to-have]
DRI: @dchevalier2 ?
-
[optional: great to have] Update the smoke test to add a variable to increase max wait time of queries - gitlab-org/quality/quality-engineering/team-tasks#2006 (closed) -
[optional: good to have] Schedule gitlab-qa-sandbox-groups subgroups for cleanup once a month - gitlab-org/quality/quality-engineering/team-tasks#2002 (closed)
Edited by Rafael Henchen