[GSTG][PG16 Upgrade] - Test multi-slot PostgreSQL logical replication upgrade playbook pre Production
Production Change
Change Summary
Validate the upgrade
multi-slot replication playbook in GSTG before a new attempt to run multi-slot replication in Production.
A couple of weeks ago we tried to perform a multi-slot test in the GPRD environment #18049 (closed), but the CR was aborted due to playbook and infrastructure related issues.
The fixes were applied into the upgrade playbook - https://gitlab.com/gitlab-com/gl-infra/dbre/-/issues/61 but we would like to validate in GSTG environment.
The new GPRD CR is: #18153 (unscheduled)
Regarding DDL lock out
We are not going to block DDLs for this CR.
We understand that there's a risk of DDLs being executed, but as the idea is just to validate the upgrade playbook execution this is not a problem, if there's any DDL blocking replication we can manually execute the DDL in the v16 cluster.
Change Details
- Services Impacted - ServicePatroni
- Change Technician - @rhenchen.gitlab
- Change Reviewer - @bshah11
- Time tracking - 5 hours
- Downtime Component - no downtime
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 240 minutes
-
Set label changein-progress /label ~change::in-progress
-
Create the following silences at https://alerts.gitlab.net - duration of 1 day: -
Rule: env="gstg"
fqdn=~"patroni-main-v16.*"
-
Rule: env=gstg
type=gstg-patroni-main-v16
-
Rule: env=gstg
type=patroni
alertname=~PostgresSplitBrain
-
-
Apply TF on MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8554 -
Check consul service endpoint db-replica-v16.service.consul. points to v16 nodes, and consul service endpoint db-replica.service.consul. is pointing to v14 replica nodes. dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short dig @127.0.0.1 -p 8600 db-replica-v16.service.consul. SRV +short dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +short dig @127.0.0.1 -p 8600 master.patroni-v16.service.consul. SRV +short
-
Follow Runbook steps to Sync the Standby cluster: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/build_cluster_from_snapshot.md#52-stop-patroni-and-reset-wal-directory-from-old-files -
Wait for the v16 cluster to get in sync -
Run upgrade playbook pre-check cd ~/src/db-migration/pg-upgrade-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-main.yml \ upgrade.yml -e "pg_old_version=14 pg_new_version=16" -e "pg_publication_count=4" \ --tags "pre-checks, packages, upgrade-check"
-
Check all connections that are not gitlab
:gitlab-psql -c " select pid, client_addr, usename, application_name, backend_type, clock_timestamp() - backend_start as connected_ago, state, left(query, 200) as query from pg_stat_activity where pid <> pg_backend_pid() and usename <> 'gitlab' and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)' and usename not in ('pgbouncer', 'postgres_exporter', 'gitlab-consul') and application_name <> 'Patroni' "
-
Wait for any DDL or large transaction to finish (it can take several minutes up to hours) -
Run Ansible playbook for Upgrading the GSTG Main v16 cluster: cd ~/src/db-migration/pg-upgrade-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-main.yml \ upgrade.yml -e "pg_old_version=14 pg_new_version=16" -e "pg_publication_count=4" 2>&1 \ | ts | tee -a ansible_upgrade_gstg_main_$(date +%Y%m%d).log
-
Check logical replication lag, and wait to get in sync -
PG Upgrade Dashboard -
Check v14 and v16 Writer nodes -
Check gitlab-patronictl list
-
Check select * from pg_stat_replication;
-
Check select * from pg_replication_slots;
-
Postgresql log files
-
-
-
Run Ansible playbook to Stop the Logical Replication: cd ~/src/db-migration/pg-upgrade-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gstg-main.yml \ stop_upgrade_replication.yml -e "pg_publication_count=4"
-
If it doesn't work use: -
Drop Subscriptions in the TARGET V16 - Query:
select * from pg_stat_subscription;
- For each subscription-slot do:
ALTER SUBSCRIPTION logical_subscription_01 DISABLE; ALTER SUBSCRIPTION logical_subscription_01 SET (slot_name=NONE); DROP SUBSCRIPTION logical_subscription_01;
- Query:
-
Drop Publication in the SOURCE V14 - Query:
select * from pg_publication;
- Query:
select slot_name, slot_type from pg_replication_slots where slot_type = 'logical';
- For each subscription-slot do:
DROP PUBLICATION IF EXISTS logical_replication_01; SELECT pg_drop_replication_slot(slot_name) FROM pg_replication_slots WHERE slot_name = 'logical_replication_slot_01';
- Query:
-
-
Check v14 and v16 Writer nodes -
Check gitlab-patronictl list
-
Check select * from pg_stat_replication;
-
Check select * from pg_replication_slots;
-
Postgresql log files
-
-
Merge https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8721 to destroy the v16 cluster
-
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30 minutes
-
Stop/cancel any playbook executions -
Revert MR and apply TF to Destroy the patroni-main-v16
cluster: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8554 -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: transactions_primary SLI Error Ratio
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: error ratio > 0.6% soft SLO for more than 10 minutes.
- Metric: rails_primary_sql SLI Apdex
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: APDEX < 99.4% soft SLO for more than 10 minutes.
- Metric: Patroni Service Error Ratio
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: error ratio > 0.6% soft SLO for more than 10 minutes.
- Metric: patroni Service Apdex
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: APDEX < 99.4% soft SLO for more than 10 minutes.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.