Patroni replica restart and Primary switchover

Production Change - Criticality 2 C2

Change Objective	Patroni replica restart and primary switchover
Change Type	Operation
Services Impacted	ServicePatroni
Change Team Members	@hphilipps, @Finotto, ongres for monitoring
Change Criticality	C2
Change Reviewer	@Finotto, @NikolayS, ongres, datastores team
Tested in staging	Tested in staging
Dry-run output	Dry run by using `ansible-playbook --check`
Due Date	2020-06-13 22:00 UTC
Time tracking	less than 1h (depending on how long we wait for connections to drop etc. The switchover itself will take less than a minute.)

Summary

To be able to decommission the nodes patroni-09..12 we want to switch over the primary DB from patroni-11 to patroni-01. As this will necessarily impose a short interruption of DB connections, we want to use this opportunity to change 3 postgres settings (max_connections: 300 -> 500, autovacuum_max_workers: 6 -> 10, max_wal_size: 5gb -> 8gb) that also require a postgres restart (and have been applied in staging already), to get both done within a single short downtime.

We will do a rolling restart of all replicas with the new settings first, taking them out of rotation one-by-one to not interrupt client connections. This will take around 20m and cause no downtime.

Then we will execute a primary switchover using patroni. This will cause a short interruption of db connections (probably below 10s). We will execute the steps during low-traffic times and most of it automated through ansible to keep the downtime and risk as low as possible.

The procedure has been tested in staging multiple times.

Detailed steps for the change

Pre-Conditions

patroni-11 is primary and patroni-01..08 are up and in sync
Chef MR with config changes is approved
ansible environment prepared on console node

Execution

Post Execution

check replication lag of delayed and archive replica
make sure vacuum analyze was executed on the new primary (automatically done on roll change by patroni)
make sure postgres-checkup project is including the new primary: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10343
remove silences but leave a silence for patroni-09..12
decommission patroni-09..12

Rollback steps

for rolling back the config changes:
- revert the MR
- run chef-client
- restart postgres on the primary (it needs to have max_connections lowered before the replicas)
  - gitlab-patronictl restart --force -r master pg11-ha-cluster
- adjust the ansible inventory to the current cluster topology (if we switched the primary)
- execute the restart postgres playbook again like above
for rolling back to the previous primary:
- adjust Ansible inventory to the current cluster topology
- ansible-playbook -i inventory/gprd.yml -e "candidate=patroni-11-db-gprd.c.gitlab-production.internal" playbooks/switchover_primary.yml
- Or in emergency without Ansible:
  - gitlab-patronictl switchover --master patroni-01-db-gprd.c.gitlab-production.internal --candidate patroni-11-db-gprd.c.gitlab-production.internal

Changes checklist

Detailed steps and rollback steps have been filled prior to commencing work
SRE on-call has been informed prior to change being rolled out
There are currently no open issues labeled as ServiceMonitoring with severities of ~S1 or ~S2

Edited Jun 16, 2020 by Henri Philipps