Upgrade postgres-dr-archive-01-db-gprd and postgres-dr-delayed-01-db-gprd to Postgres 11.7
C3
Production Change - Criticality 3Change Objective | Upgrade Postgres on delayed replica archive and re-sync replication |
---|---|
Change Type | ConfigurationChange |
Services Impacted | ServicePostgres |
Change Team Members | @craig @NikolayS @emanuel_ongres |
Change Criticality | C3 |
Change Reviewer or tested in staging | @alejandro |
Dry-run output | If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result |
Due Date | 2020-05-12 2230UTC (1530 PDT) |
Time tracking | To estimate and record times associated with changes ( including a possible rollback ) |
Detailed steps for the change
Part 1: Create GCP snapshot
-
Pick the node which is in the same zone than the archive/delayed. This is in order to speed up the restore process. (prefer a node which is in the same AZ as the existing “archive” instance. “postgres-dr-archive-01-db-gprd” is in us-east1-c -- patroni-07-db-gprd, patroni-04-db-gprd, patroni-01-db-gprd are also in us-east1-c) -
Initiate tmux session on patroni-11 (leader) under gitlab-psql user -
Create session tmux new -s 2115
(tmux a -t 2115
for attaching session)
-
-
Disable chef on patroni-07-db-gprd. systemctl stop chef-client.service
-
In patroni, mark one of production replicas as not available for promotion and for read/only queries. Add the following tags to patroni.yml: tags: nofailover: true noloadbalance: true
-
Reload Patroni on the elected replica: -
gitlab-patronictl reload pg11-ha-cluster patroni-07-db-gprd.c.gitlab-production.internal
-
-
Execute select pg_start_backup('gcs_snapshot_20200512', false, false)
on the grpd master AND KEEP THE CONNECTION OPEN, remember the LSN -
Wait until the LSN is propagated to the chosen replica -
Change the target of the command to the chosen replica (current leader is 11) -
mussh -m -b -i $HOME/.ssh/id_rsa_gitlab_ecalvo -h patroni-{07,11}-db-gprd.c.gitlab-production.internal -c "sudo -u gitlab-psql /opt/td-agent/embedded/bin/pg_controldata /var/opt/gitlab/postgresql/data11 | grep 'Latest checkpoint'"
-
-
In GCP console, create a snapshot for sdb disk on patroni-07-db-gprd
-
Execute select pg_stop_backup(false, true)
on the master IN THE SAME CONNECTION wherepg_start_backup
was executed -
Remove added tags and patroni reload. -
Re-enable chef. systemctl start chef-client.service
-
Kill tmux session (exit)
Prep for snapshot mounts on replicas
-
Chef client stop & disable on postgres-dr-archive-01-db-gprd
&postgres-dr-delayed-01-db-gprd
-
Snapshot postgres-dr-archive-01-db-gprd
&postgres-dr-delayed-01-db-gprd
for rollback to pre-upgrade version/state -
Merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3409
Part 2: Use GCP snapshot to create sdb on the “archive” replica
-
Delete/Unmount current sdb on postgres-dr-archive-01-db-gprd
-
Restore snapshot from patroni-07-db-gprd to sdb disk on postgres-dr-archive-01-db-gprd
(GCE disk name:postgres-dr-archive-01-db-gprd-data
) -
Place the corresponding recovery.conf in the archive node: standby_mode = 'on' restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch "-p 32" "%f" "%p"' recovery_target_timeline = 'latest'
-
Remove and re-install new version of gitlab-ee (remove 11.11.8 and install 12.10.3) -
Restore/re-sync WAL archives for replica to replay -
Re-enable & run chef-client to upgrade gitlab-omnibus & postgresql on postgres-dr-archive-01-db-gprd
-
chef-client-enable
-
-
Start/validate postgres with snapshot data -
/opt/gitlab/embedded/bin/gitlab-pg-ctl start
-
Part 3: Use GCP snapshot to creates db on the "delayed" replica
-
Delete/Unmount current sdb on postgres-dr-delayed-01-db-gprd
-
Restore snapshot from patroni-07-db-gprd to sdb disk on postgres-dr-delayed-01-db-gprd
(GCE disk name:postgres-dr-delayed-01-db-gprd-data
) -
Place the corresponding recovery.conf in the delayed-replica node: standby_mode = 'on' recovery_min_apply_delay = '8h' restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch "-p 32" "%f" "%p"' recovery_target_timeline = 'latest'
-
Remove and re-install new version of gitlab-ee (remove 11.11.8 and install 12.10.3) -
Restore/re-sync WAL archives for replica to replay -
Re-enable & run chef-client to upgrade gitlab-omnibus & postgresql on postgres-dr-delayed-01-db-gprd
-
chef-client-enable
-
-
Start/validate postgres with snapshot data -
/opt/gitlab/embedded/bin/gitlab-pg-ctl start
-
Rollback steps
Restore GCE VM instances from snapshots taken in step 2.2 above (Snapshot postgres-dr-archive-01-db-gprd
& postgres-dr-delayed-01-db-gprd
for rollback to pre-upgrade version/state) and re-evaluate/restart process.
Also, as a Plan B, we can consider using an approach similar to gitlab-restore
to seed the necessary data, instead of snapshots.
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
SRE on-call has been informed prior to change being rolled out -
There are currently no open issues labeled as ServiceMonitoring with severities of ~S1 or ~S2
Edited by Craig Barrett