[GSTG][PG16 Upgrade] - Test multi-slot PostgreSQL logical replication upgrade playbook pre Production

Production Change

Change Summary

Validate the upgrade multi-slot replication playbook in GSTG before a new attempt to run multi-slot replication in Production.

A couple of weeks ago we tried to perform a multi-slot test in the GPRD environment #18049 (closed), but the CR was aborted due to playbook and infrastructure related issues.

The fixes were applied into the upgrade playbook - https://gitlab.com/gitlab-com/gl-infra/dbre/-/issues/61 but we would like to validate in GSTG environment.

The new GPRD CR is: #18153 (unscheduled)

Regarding DDL lock out

We are not going to block DDLs for this CR.

We understand that there's a risk of DDLs being executed, but as the idea is just to validate the upgrade playbook execution this is not a problem, if there's any DDL blocking replication we can manually execute the DDL in the v16 cluster.

Change Details

Services Impacted - ServicePatroni
Change Technician - @rhenchen.gitlab
Change Reviewer - @bshah11
Time tracking - 5 hours
Downtime Component - no downtime

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 240 minutes

Set label changein-progress /label ~change::in-progress
Create the following silences at https://alerts.gitlab.net - duration of 1 day:
- Rule: env="gstg" fqdn=~"patroni-main-v16.*"
- Rule: env=gstg type=gstg-patroni-main-v16
- Rule: env=gstg type=patroni alertname=~PostgresSplitBrain
Apply TF on MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8554

Check consul service endpoint db-replica-v16.service.consul. points to v16 nodes, and consul service endpoint db-replica.service.consul. is pointing to v14 replica nodes.

dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short
dig @127.0.0.1 -p 8600 db-replica-v16.service.consul. SRV +short

dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +short
dig @127.0.0.1 -p 8600 master.patroni-v16.service.consul. SRV +short

Follow Runbook steps to Sync the Standby cluster: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/build_cluster_from_snapshot.md#52-stop-patroni-and-reset-wal-directory-from-old-files
Wait for the v16 cluster to get in sync

Run upgrade playbook pre-check

cd ~/src/db-migration/pg-upgrade-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gstg-main.yml \
  upgrade.yml -e "pg_old_version=14 pg_new_version=16" -e "pg_publication_count=4" \
    --tags "pre-checks, packages, upgrade-check"

Check all connections that are not gitlab:

gitlab-psql -c "
  select
    pid, client_addr, usename, application_name, backend_type,
    clock_timestamp() - backend_start as connected_ago,
    state,
    left(query, 200) as query
  from pg_stat_activity
  where
    pid <> pg_backend_pid()
    and usename <> 'gitlab' 
    and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)'
    and usename not in ('pgbouncer', 'postgres_exporter', 'gitlab-consul')
    and application_name <> 'Patroni'
  "

Wait for any DDL or large transaction to finish (it can take several minutes up to hours)

Run Ansible playbook for Upgrading the GSTG Main v16 cluster:

cd ~/src/db-migration/pg-upgrade-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gstg-main.yml \
  upgrade.yml -e "pg_old_version=14 pg_new_version=16" -e "pg_publication_count=4" 2>&1 \
| ts | tee -a ansible_upgrade_gstg_main_$(date +%Y%m%d).log

Check logical replication lag, and wait to get in sync
- PG Upgrade Dashboard
- Check v14 and v16 Writer nodes
  - Check gitlab-patronictl list
  - Check select * from pg_stat_replication;
  - Check select * from pg_replication_slots;
  - Postgresql log files

Run Ansible playbook to Stop the Logical Replication:

cd ~/src/db-migration/pg-upgrade-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gstg-main.yml \
  stop_upgrade_replication.yml -e "pg_publication_count=4"

If it doesn't work use:

Drop Subscriptions in the TARGET V16

Query: select * from pg_stat_subscription;

For each subscription-slot do:

  ALTER SUBSCRIPTION logical_subscription_01 DISABLE;
  ALTER SUBSCRIPTION logical_subscription_01 SET (slot_name=NONE);
  DROP SUBSCRIPTION logical_subscription_01;

Drop Publication in the SOURCE V14
- Query: select * from pg_publication;
- Query: select slot_name, slot_type from pg_replication_slots where slot_type = 'logical';
- For each subscription-slot do:
```
  DROP PUBLICATION IF EXISTS logical_replication_01;
  SELECT pg_drop_replication_slot(slot_name) FROM pg_replication_slots WHERE slot_name = 'logical_replication_slot_01';
```

Check v14 and v16 Writer nodes
- Check gitlab-patronictl list
- Check select * from pg_stat_replication;
- Check select * from pg_replication_slots;
- Postgresql log files
Merge https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8721 to destroy the v16 cluster

Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 minutes

Stop/cancel any playbook executions
Revert MR and apply TF to Destroy the patroni-main-v16 cluster: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8554
Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Metric: transactions_primary SLI Error Ratio
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: error ratio > 0.6% soft SLO for more than 10 minutes.
Metric: rails_primary_sql SLI Apdex
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: APDEX < 99.4% soft SLO for more than 10 minutes.
Metric: Patroni Service Error Ratio
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: error ratio > 0.6% soft SLO for more than 10 minutes.
Metric: patroni Service Apdex
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: APDEX < 99.4% soft SLO for more than 10 minutes.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Jul 01, 2024 by Rafael Henchen