Skip to content

GPRD PG14 Main - Part 1 - Fix Issue Template TODO list after the failed upgrade of 26/08/2013

Rafael Henchen requested to merge 230909-1-fix-issue-template into master

Fix Issue Template as listed on https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24342

Test issue is #22

Fixes:

  • IMPORTANT: partitioning wasn't stopped by disabling background migrations, add disable of partitioning on T-2 days: chatops disable FF task
    • we missed to add /chatops run feature set partition_manager_sync_partitions false in the issue template as recommended by Simon from DB team on July 11th
  • Add checks if feature flags were disabled before Part 1 and Part 2 (eg. /chatops run feature get)
  • The start time for silencing alertname=~"WALGBaseBackupFailed|walgBaseBackupDelayed" is wrong, it should be T-14 hours, and not T zero and check if silencing 29 hours is enough
  • On Part1 pgbouncer connection check should check for the target clusters nodes only and not all nodes, eg. pgbouncer_pools_client_active_connections{env="gprd", fqdn=~"patroni-main-v14.*"}
  • Add a step on Part 1 to wait for the logical replication to get in sync before starting the post-upgrade checks
  • Improve pg_amcheck to avoid blocking autovacuum for too long, check if we can use statement_timeout of 30 minutes (export PGOPTIONS="-c statement_timeout=30min" pg_amcheck ...)
Edited by Rafael Henchen

Merge request reports