GPRD PG14 Main - Part 1 - Fix Issue Template TODO list after the failed upgrade of 26/08/2013
Fix Issue Template as listed on https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24342
Test issue is #22
Fixes:
-
IMPORTANT: partitioning wasn't stopped by disabling background migrations, add disable of partitioning on T-2 days: chatops disable FF
task-
we missed to add /chatops run feature set partition_manager_sync_partitions false
in the issue template as recommended by Simon from DB team on July 11th
-
-
Add checks if feature flags were disabled before Part 1 and Part 2 (eg. /chatops run feature get
) -
The start time for silencing alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed"
is wrong, it should beT-14 hours
, and notT zero
and check if silencing 29 hours is enough -
On Part1 pgbouncer connection check should check for the target clusters nodes only and not all nodes, eg. pgbouncer_pools_client_active_connections{env="gprd", fqdn=~"patroni-main-v14.*"}
-
Add a step on Part 1 to wait for the logical replication to get in sync before starting the post-upgrade checks -
Improve pg_amcheck
to avoid blocking autovacuum for too long, check if we can usestatement_timeout
of 30 minutes (export PGOPTIONS="-c statement_timeout=30min" pg_amcheck ...
)
Edited by Rafael Henchen