Upgrade postgres-dr-archive-01-db-gprd and postgres-dr-delayed-01-db-gprd to Postgres 11.7

Production Change - Criticality 3 C3

Change Objective	Upgrade Postgres on delayed replica archive and re-sync replication
Change Type	ConfigurationChange
Services Impacted	ServicePostgres
Change Team Members	@craig @NikolayS @emanuel_ongres
Change Criticality	C3
Change Reviewer or tested in staging	@alejandro
Dry-run output	If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result
Due Date	2020-05-12 2230UTC (1530 PDT)
Time tracking	To estimate and record times associated with changes ( including a possible rollback )

Detailed steps for the change

Part 1: Create GCP snapshot

Pick the node which is in the same zone than the archive/delayed. This is in order to speed up the restore process. (prefer a node which is in the same AZ as the existing “archive” instance. “postgres-dr-archive-01-db-gprd” is in us-east1-c -- patroni-07-db-gprd, patroni-04-db-gprd, patroni-01-db-gprd are also in us-east1-c)
Initiate tmux session on patroni-11 (leader) under gitlab-psql user
- Create session tmux new -s 2115 (tmux a -t 2115 for attaching session)
Disable chef on patroni-07-db-gprd. systemctl stop chef-client.service
In patroni, mark one of production replicas as not available for promotion and for read/only queries. Add the following tags to patroni.yml:
```
tags:
  nofailover: true
  noloadbalance: true
```
Reload Patroni on the elected replica:
- gitlab-patronictl reload pg11-ha-cluster patroni-07-db-gprd.c.gitlab-production.internal
Execute select pg_start_backup('gcs_snapshot_20200512', false, false) on the grpd master AND KEEP THE CONNECTION OPEN, remember the LSN
Wait until the LSN is propagated to the chosen replica
1. Change the target of the command to the chosen replica (current leader is 11)
2. mussh -m -b -i $HOME/.ssh/id_rsa_gitlab_ecalvo -h patroni-{07,11}-db-gprd.c.gitlab-production.internal -c "sudo -u gitlab-psql /opt/td-agent/embedded/bin/pg_controldata /var/opt/gitlab/postgresql/data11 | grep 'Latest checkpoint'"
In GCP console, create a snapshot for sdb disk on patroni-07-db-gprd
Execute select pg_stop_backup(false, true) on the master IN THE SAME CONNECTION where pg_start_backup was executed
Remove added tags and patroni reload.
Re-enable chef. systemctl start chef-client.service
Kill tmux session (exit)

Prep for snapshot mounts on replicas

Chef client stop & disable on postgres-dr-archive-01-db-gprd & postgres-dr-delayed-01-db-gprd
Snapshot postgres-dr-archive-01-db-gprd & postgres-dr-delayed-01-db-gprd for rollback to pre-upgrade version/state
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3409

Part 2: Use GCP snapshot to create sdb on the “archive” replica

Delete/Unmount current sdb on postgres-dr-archive-01-db-gprd
Restore snapshot from patroni-07-db-gprd to sdb disk on postgres-dr-archive-01-db-gprd (GCE disk name: postgres-dr-archive-01-db-gprd-data)

Place the corresponding recovery.conf in the archive node:

standby_mode = 'on'
restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch "-p 32" "%f" "%p"'
recovery_target_timeline = 'latest'

Remove and re-install new version of gitlab-ee (remove 11.11.8 and install 12.10.3)
Restore/re-sync WAL archives for replica to replay
Re-enable & run chef-client to upgrade gitlab-omnibus & postgresql on postgres-dr-archive-01-db-gprd
- chef-client-enable
Start/validate postgres with snapshot data
- /opt/gitlab/embedded/bin/gitlab-pg-ctl start

Part 3: Use GCP snapshot to creates db on the "delayed" replica

Delete/Unmount current sdb on postgres-dr-delayed-01-db-gprd
Restore snapshot from patroni-07-db-gprd to sdb disk on postgres-dr-delayed-01-db-gprd (GCE disk name: postgres-dr-delayed-01-db-gprd-data)

Place the corresponding recovery.conf in the delayed-replica node:

standby_mode = 'on'
recovery_min_apply_delay = '8h'
restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch "-p 32" "%f" "%p"'
recovery_target_timeline = 'latest'

Remove and re-install new version of gitlab-ee (remove 11.11.8 and install 12.10.3)
Restore/re-sync WAL archives for replica to replay
Re-enable & run chef-client to upgrade gitlab-omnibus & postgresql on postgres-dr-delayed-01-db-gprd
- chef-client-enable
Start/validate postgres with snapshot data
- /opt/gitlab/embedded/bin/gitlab-pg-ctl start

Rollback steps

Restore GCE VM instances from snapshots taken in step 2.2 above (Snapshot postgres-dr-archive-01-db-gprd & postgres-dr-delayed-01-db-gprd for rollback to pre-upgrade version/state) and re-evaluate/restart process.

Also, as a Plan B, we can consider using an approach similar to gitlab-restore to seed the necessary data, instead of snapshots.

Changes checklist

Detailed steps and rollback steps have been filled prior to commencing work
SRE on-call has been informed prior to change being rolled out
There are currently no open issues labeled as ServiceMonitoring with severities of ~S1 or ~S2

/cc @NikolayS @emanuel_ongres @alejandro @albertoramos

Edited May 13, 2020 by Craig Barrett