gprd-main PG12->PG14 upgrade attempt 2023-08-26 – overview of the issues we had

Here is an overview of issues we had during PG12->PG14 upgrade attempt on 2023-08-26 (issue, CR (internal), epic, additional issue with list of todo items for the upgrade process). This list is more high-level, and full solution to some problems is not trivial and would require additional efforts; this is in addition to the detailed list prepared by @rhenchen.gitlab in production#14403 (comment 1530821685).

One server, patroni-main-2004-102, did not have the Consul configuration files required for upgrade: db-replica-6.json, db-replica-7.json, db-replica-8.json, db-replica-9.json
- ~~They should be created by Chef – RCA needed (@rhenchen.gitlab @bshah11 @alexander-sosna)~~
- Not all servers are created the same, the Ansible playbook should check the presence of such files in the pre-ckeck stage, so we wouldn't start the upgrade process if they are missing – we'll add this (cc @vitabaks) - corrective action
After the first issue (described above), when the switchover playbook was retried, Consul re-configuration wasn't applied because playbook didn't issue reload
- Already fixed and merged in db-migration!484 (merged) corrective action
xact_rollback spikes during the use of logical replication become more annoying and resource-consuming. We know they happen when we drop logical slots, but they also happen at different times, causing investigation efforts, that slow down the upgrade/switchover processes; at minimum, we need to distinguish these issues from any user-facing xact_rollback spikes, and ideally we should prevent them or find a way to ignore in monitoring noise
- the issues is easy to reproduce during slot deletion (pg_drop_replication_slot()), studied in postgres-ai/postgresql-consulting/tests-and-benchmarks#39 – RCA/troubleshooting needed (it looks like a bug, so perhaps it's time to file it in pgsql-bugs and discuss)
Some feature flags, accidentally, were not switched off during the upgrade – background migrations, index recreation. This caused the need to manually propagate DDL when a new partition was created, and cause delays in the switchover process when pre-check detected a long-running transaction (index recreation)
- after setting flags, we should get the values to check them (ideally, we could have such verification in the playbook in pre-checks) corrective action
- per @stomlinson, the will be soon a single flag to combine all flags needed for maintenance window, so the risks of such issues will become lower - gitlab-org/gitlab#417161 (closed) corrective action
Performance degradation for some RO queries. Some of degraded queries caused repeatable QA test failures (test 6/10 and 8/10, example, more in the CR). These degradations were not identified by tests we had before the upgrade – because we only looked at Top-N from pg_stat_statements by calls and total_time.
- Detailed analysis of all RO queries timing out and troubleshooting of plans in PG12&PG14 clones – todo: separate issue with full list of queryid/fingerprint values of queries that timed out, @fomin.list is on it, once the list is collected, we'll ping @alexives @stomlinson there https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24339 corrective action
- Ideas:
  - Improvement idea: pre-upgrade production tests could include partial RO traffic switchover. In this case, our tests could not be considered as not user-facing already, but if done carefully, this could be beneficial, giving a high level of confidence for the final upgrade steps. Although, this would test only RO traffic, it is relatively easy to implement (no extra tools needed).
  - Another idea: improve the process of plan comparison, using various methods (make PG14 more accessible in DBLab; collect more plans, not only Top-N; etc)
  - Sampling always provides only a limited view on the problem – the ideal regression testing would be based on traffic mirroring - it is worth exploring this path, especially tools like pgcat that supports it (though, infrastructure changes would be needed)
  - More conservative approach during upgrades: e.g., we could decide that we don't want new optimizations to be enabled right away, keeping them off until fully studied – we did it in the past when CTE materialization (optimization fence) was removed in default behavior in PG12, and with jit default change from off to on in PG12 – here we could also decide to postpone enabling new planner features (enable_memoize in this case), even if we didn't find obvious degradation in plans. Although, this approach has cons: some current and future plans could obviously benefit from new features, but if we keep them disabled, chances to benefit are low; this is worth a separate discussion.
It took 40 minutes to drain traffic from the pg12 replicas, which burned a lot of the upgrade window (gitlab-org/gitlab#423382 (closed)), this affected not only the switchover to v14 but also delayed the rollback (to more than 1 hour, ideally it should not take more than 30 minutes)
- We need to investigate why the connections are being kept for that long as Rails refresh the replicas list based on LSN/sync every 2 minutes corrective action

Edited Aug 29, 2023 by Biren Shah