[GPRD][PG16 Upgrade] - Test multi-slot PostgreSQL logical replication upgrade playbook with Production workload for 24 hours
Production Change
Change Summary
Last year during the PG14 upgrade, we found that single slot logical replication couldn't handle our MAIN database workload due to single processing WAL decoding limitations.
This year we are planning to use multi-slot logical replication for the PG16 upgrade. For tha reasons, we require to perform a test of the upgrade
playbook with 4 slot logical replication and monitor it for 24 hours in the Production environment to observe if the multi-slot solution will handle the workload and what kind of data inconsistency and errors might happen in the standby cluster.
The idea of multi-slot logical replication is to allow us to antecipate the upgrade
from the switchover
as much as possible, which should reduce the maintenance window time and risks, and more important to have a longer period of rollback (ideally multiple days), hence reducing the risks in case of performance regression in the 16 version.
Note: with single-slot we could only allow a rollback period of a few hours;
Objectives
-
Measure playbook steps timing with real production data -
Evaluate Multi-Slot logical replication for 24 hours (including a peak period) -
Analyze metrics and lagging; -
Analyze errors and data consistency issues;
-
Change Details
- Services Impacted - ServicePatroni
- Change Technician - @rhenchen.gitlab
- Change Reviewer - @bshah11 @vitabaks
- Time tracking - 4 days
- Downtime Component - no downtime (database migrations need to be disabled for 24 hours)
PCL - Production Change Lock
As discussed at https://gitlab.slack.com/archives/CCFV016SV/p1716512591351799 we changed the CR execution window and we also need to define a Soft PCL to block "delivery" during the period
- MR Soft PCL during CR 18049 - change-lock!42 (merged)
Set Maintenance Mode in GitLab
If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete - 3 days
paroni-main-v16
single-node cluster
T minus 24 hours (2024-05-29 00:00 UTC) - Launch -
Set label changein-progress /label ~change::in-progress
-
Create a silence at https://alerts.gitlab.net for node fqdn=~"paroni-main-v16.*"
andenv="gprd"
duration of 5 days -
Apply TF on MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8509 -
Check consul service endpoint db-replica-v16.service.consul. points to v16 nodes, and consul service endpoint db-replica.service.consul. is pointing to v14 replica nodes. dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short dig @127.0.0.1 -p 8600 db-replica-v16.service.consul. SRV +short dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +short dig @127.0.0.1 -p 8600 master.patroni-v16.service.consul. SRV +short
-
Follow Runbook steps to Sync the Standby cluster: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/build_cluster_from_snapshot.md#52-stop-patroni-and-reset-wal-directory-from-old-files -
Set label changescheduled /label ~change::scheduled
Prepare console VM (if not done yet)
-
SSH to the console VM in gprd
-
Configure dbupgrade user -
Disable screen sharing to reduce risk of exposing private key -
Change to user dbupgrade sudo su - dbupgrade
-
Copy dbupgrade user's private key from 1Password to ~/.ssh/id_dbupgrade
-
chmod 600 ~/.ssh/id_dbupgrade
-
Use key as default ln -s /home/dbupgrade/.ssh/id_dbupgrade /home/dbupgrade/.ssh/id_rsa
-
Repeat the same steps steps on target leader (it also has to have the private key) -
Enable re-screen sharing if benficial
-
-
Start a / resume the tmux session tmux a -t pg14 || tmux new -s pg14
-
Create an access_token with at least read_repository
for the next step -
Clone repos: rm -rf ~/src \ && mkdir ~/src \ && cd ~/src \ && git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git \ && cd db-migration \ && git checkout latest_stable
-
Ensure you have the pre-requisites installed: sudo apt install ansible
-
Ensure that Ansible can talk to all the hosts in gprd-main cd ~/src/db-migration/pg-upgrade-logical ansible -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" -i inventory/gprd-main.yml all -m ping
-
In advance, run pre-checks, and upgrade-check, pre-install packages to ensure that everything is ready for future upgrade: cd ~/src/db-migration/pg-upgrade-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-main.yml \ upgrade.yml -e "pg_old_version=14 pg_new_version=16" -e "pg_publication_count=4" \ --tags "pre-checks, packages, upgrade-check" 2>&1 \ | ts | tee -a ansible_upgrade_pre_checks_gprd_main_$(date +%Y%m%d).log
-
Refresh tmux command and shortcut knowledge, https://tmuxcheatsheet.com/. To leave tmux without stopping it, use sequence Ctl-b, Ctrl-z
upgrade
playbook in the paroni-main-v16
cluster to upgrade the v16 cluster and configure multi-slot logical replication
T ZERO TEST DAY (2024-05-30 00:00 UTC) - - Run the -
Reach out EOC in the #production channel and request a green light -
Set label changein-progress /label ~change::in-progress
-
Disable the DDL-related feature flags: -
Disable feature flags by typing the following into #production: /chatops run feature set disallow_database_ddl_feature_flags true
-
Inform in #g_delivery , #production_engineering and #g_database that DDL feature flags have been disabled until the CR is complete. Hi @db-team, @release-managers @sre-oncall, Please note that we have disabled DDLs in our Production evironment, including `execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions tasks, as we are carrying out the previously discussed test for database logical replication between `2024-05-29 00:00 UTC` and `2024-05-30 00:00 UTC`. This also will affect any deployments that require to perform changes in the database model. We should re-enable the feature flag in 24 hours, once the test is complete. Thank you for your patience.
-
-
On slack at #production /chatops run feature get disallow_database_ddl_feature_flags
-
Check all connections that are not gitlab
:gitlab-psql -c " select pid, client_addr, usename, application_name, backend_type, clock_timestamp() - backend_start as connected_ago, state, left(query, 200) as query from pg_stat_activity where pid <> pg_backend_pid() and usename <> 'gitlab' and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)' and usename not in ('pgbouncer', 'postgres_exporter', 'gitlab-consul') and application_name <> 'Patroni' "
-
Wait for any DDL or large transaction to finish (it can take several minutes up to hours) -
Run Ansible playbook for Upgrading the gprd-main cluster: cd ~/src/db-migration/pg-upgrade-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-main.yml \ upgrade.yml -e "pg_old_version=14 pg_new_version=16" -e "pg_publication_count=4" 2>&1 \ | ts | tee -a ansible_upgrade_gprd_main_$(date +%Y%m%d).log
-
Check logical replication lag, and wait to get in sync - PG Upgrade Dashboard
-
Check v14 and v16 Writer nodes -
Check gitlab-patronictl list
-
Check select * from pg_stat_replication;
-
Check select * from pg_replication_slots;
-
Postgresql log files
-
-
Update rollback steps -
Keep label changein-progress /label ~change::in-progress
-
Notify DBRE, DB Backend Engineers and SRE counterparts that the test is in progress
T plus 24 hours (2024-05-31 00:00 UTC) - Stop the multi-slot replication and gather data
-
Set label changein-progress /label ~change::in-progress
-
Run Ansible playbook to Stop the Logical Replication: cd ~/src/db-migration/pg-upgrade-logical ansible-playbook \ -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \ -i inventory/gprd-main.yml \ stop_upgrade_replication.yml -e "pg_publication_count=4" 2>&1 \ | ts | tee -a stop_replication_gprd_main_$(date +%Y%m%d).log
-
If it doesn't work use: -
Drop Subscriptions in the TARGET V16 - Query:
select * from pg_stat_subscription;
- For each subscription-slot do:
ALTER SUBSCRIPTION logical_subscription_01 DISABLE; ALTER SUBSCRIPTION logical_subscription_01 SET (slot_name=NONE); DROP SUBSCRIPTION logical_subscription_01;
- Query:
-
Drop Publication in the SOURCE V14 - Query:
select * from pg_publication;
- Query:
select slot_name, slot_type from pg_replication_slots where slot_type = 'logical';
- For each subscription-slot do:
DROP PUBLICATION IF EXISTS logical_replication_01; SELECT pg_drop_replication_slot(slot_name) FROM pg_replication_slots WHERE slot_name = 'logical_replication_slot_01';
- Query:
-
-
Check v14 and v16 Writer nodes -
Check gitlab-patronictl list
-
Check select * from pg_stat_replication;
-
Check select * from pg_replication_slots;
-
Postgresql log files
-
-
Disable the DDL-related feature flags: -
Disable feature flags by typing the following into #production: /chatops run feature set disallow_database_ddl_feature_flags false
-
Inform the database team and Release managers that DDL feature flags is re-enabled in #g_delivery and #g_database Hi @db-team, @release-managers @sre-oncall, FYI, we just finished the test for database logical replication and we have re-enabled DDL feature flags in Production and any restrictions to database model changes were lifted. Thanks for your patience!
-
On slack at #production /chatops run feature get disallow_database_ddl_feature_flags
-
-
Notify DBRE, DB Backend Engineers and SRE counterparts that the test is finished and that -
Revert the MR change-lock!42 (merged) to remove the PCL from schedule -
Create Revert MR of https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8509 to destroy the v16 cluster -
Set label changescheduled /label ~change::scheduled
paroni-main-v16
single-node cluster
T plus 48 hours (2024-06-01 00:00 UTC) - Destroy -
Set label changein-progress /label ~change::in-progress
-
Apply TF in the MR: TODO Destroy MR -
Remove a silence at https://alerts.gitlab.net for node fqdn=~"patroni-main-v16.*"
andenv="gprd"
duration of 5 days -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30 minutes
-
Revert MR and apply TF to Destroy the patroni-main-v16
cluster: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8549 -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
-
Metric: transactions_primary SLI Error Ratio
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: error ratio > 0.6% soft SLO for more than 10 minutes.
-
Metric: rails_primary_sql SLI Apdex
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: APDEX < 99.4% soft SLO for more than 10 minutes.
-
Metric: Patroni Service Error Ratio
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: error ratio > 0.6% soft SLO for more than 10 minutes.
-
Metric: patroni Service Apdex
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: APDEX < 99.4% soft SLO for more than 10 minutes.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.