[GPRD][PG16 Upgrade] - Test multi-slot PostgreSQL logical replication upgrade playbook with Production workload for 24 hours

Production Change

Change Summary

Last year during the PG14 upgrade, we found that single slot logical replication couldn't handle our MAIN database workload due to single processing WAL decoding limitations.

This year we are planning to use multi-slot logical replication for the PG16 upgrade. For tha reasons, we require to perform a test of the upgrade playbook with 4 slot logical replication and monitor it for 24 hours in the Production environment to observe if the multi-slot solution will handle the workload and what kind of data inconsistency and errors might happen in the standby cluster.

The idea of multi-slot logical replication is to allow us to antecipate the upgrade from the switchover as much as possible, which should reduce the maintenance window time and risks, and more important to have a longer period of rollback (ideally multiple days), hence reducing the risks in case of performance regression in the 16 version.

Note: with single-slot we could only allow a rollback period of a few hours;

Objectives

Measure playbook steps timing with real production data
Evaluate Multi-Slot logical replication for 24 hours (including a peak period)
- Analyze metrics and lagging;
- Analyze errors and data consistency issues;

Change Details

Services Impacted - ServicePatroni
Change Technician - @rhenchen.gitlab
Change Reviewer - @bshah11 @vitabaks
Time tracking - 4 days
Downtime Component - no downtime (database migrations need to be disabled for 24 hours)

PCL - Production Change Lock

As discussed at https://gitlab.slack.com/archives/CCFV016SV/p1716512591351799 we changed the CR execution window and we also need to define a Soft PCL to block "delivery" during the period

MR Soft PCL during CR 18049 - change-lock!42 (merged)

Set Maintenance Mode in GitLab

If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete - 3 days

T minus 24 hours (2024-05-29 00:00 UTC) - Launch `paroni-main-v16` single-node cluster

Set label changein-progress /label ~change::in-progress
Create a silence at https://alerts.gitlab.net for node fqdn=~"paroni-main-v16.*" and env="gprd" duration of 5 days
Apply TF on MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8509

Check consul service endpoint db-replica-v16.service.consul. points to v16 nodes, and consul service endpoint db-replica.service.consul. is pointing to v14 replica nodes.

dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short
dig @127.0.0.1 -p 8600 db-replica-v16.service.consul. SRV +short

dig @127.0.0.1 -p 8600 master.patroni.service.consul. SRV +short
dig @127.0.0.1 -p 8600 master.patroni-v16.service.consul. SRV +short

Follow Runbook steps to Sync the Standby cluster: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/build_cluster_from_snapshot.md#52-stop-patroni-and-reset-wal-directory-from-old-files
Set label changescheduled /label ~change::scheduled

Prepare console VM (if not done yet)

SSH to the console VM in gprd
Configure dbupgrade user
- Disable screen sharing to reduce risk of exposing private key
- Change to user dbupgrade sudo su - dbupgrade
- Copy dbupgrade user's private key from 1Password to ~/.ssh/id_dbupgrade
- chmod 600 ~/.ssh/id_dbupgrade
- Use key as default ln -s /home/dbupgrade/.ssh/id_dbupgrade /home/dbupgrade/.ssh/id_rsa
- Repeat the same steps steps on target leader (it also has to have the private key)
- Enable re-screen sharing if benficial
Start a / resume the tmux session tmux a -t pg14 || tmux new -s pg14
Create an access_token with at least read_repository for the next step

Clone repos:

rm -rf ~/src \
  && mkdir ~/src \
  && cd ~/src \
  && git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git \
  && cd db-migration \
  && git checkout latest_stable

Ensure you have the pre-requisites installed:
```
sudo apt install ansible
```

Ensure that Ansible can talk to all the hosts in gprd-main

cd ~/src/db-migration/pg-upgrade-logical
ansible -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" -i inventory/gprd-main.yml all -m ping

In advance, run pre-checks, and upgrade-check, pre-install packages to ensure that everything is ready for future upgrade:

cd ~/src/db-migration/pg-upgrade-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gprd-main.yml \
  upgrade.yml -e "pg_old_version=14 pg_new_version=16" -e "pg_publication_count=4"  \
  --tags "pre-checks, packages, upgrade-check" 2>&1 \
| ts | tee -a ansible_upgrade_pre_checks_gprd_main_$(date +%Y%m%d).log

Refresh tmux command and shortcut knowledge, https://tmuxcheatsheet.com/. To leave tmux without stopping it, use sequence Ctl-b, Ctrl-z

T ZERO TEST DAY (2024-05-30 00:00 UTC) - - Run the `upgrade` playbook in the `paroni-main-v16` cluster to upgrade the v16 cluster and configure multi-slot logical replication

Reach out EOC in the #production channel and request a green light
Set label changein-progress /label ~change::in-progress

Disable the DDL-related feature flags:

Disable feature flags by typing the following into #production:

/chatops run feature set disallow_database_ddl_feature_flags true

Inform in #g_delivery , #production_engineering and #g_database that DDL feature flags have been disabled until the CR is complete.

Hi @db-team, @release-managers @sre-oncall,

Please note that we have disabled DDLs in our Production evironment, including `execute_batched_migrations_on_schedule` and `execute_background_migrations`, reindexing, async_foreign_key, async_index features and partition_manager_sync_partitions tasks, as we are carrying out the previously discussed test for database logical replication between `2024-05-29 00:00 UTC` and `2024-05-30 00:00 UTC`.

This also will affect any deployments that require to perform changes in the database model.

We should re-enable the feature flag in 24 hours, once the test is complete.
Thank you for your patience.

On slack at #production /chatops run feature get disallow_database_ddl_feature_flags

Check all connections that are not gitlab:

gitlab-psql -c "
  select
    pid, client_addr, usename, application_name, backend_type,
    clock_timestamp() - backend_start as connected_ago,
    state,
    left(query, 200) as query
  from pg_stat_activity
  where
    pid <> pg_backend_pid()
    and usename <> 'gitlab' 
    and not backend_type ~ '(walsender|logical replication|pg_wait_sampling)'
    and usename not in ('pgbouncer', 'postgres_exporter', 'gitlab-consul')
    and application_name <> 'Patroni'
  "

Wait for any DDL or large transaction to finish (it can take several minutes up to hours)

Run Ansible playbook for Upgrading the gprd-main cluster:

cd ~/src/db-migration/pg-upgrade-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gprd-main.yml \
  upgrade.yml -e "pg_old_version=14 pg_new_version=16" -e "pg_publication_count=4" 2>&1 \
| ts | tee -a ansible_upgrade_gprd_main_$(date +%Y%m%d).log

Check logical replication lag, and wait to get in sync
- PG Upgrade Dashboard
- Check v14 and v16 Writer nodes
  - Check gitlab-patronictl list
  - Check select * from pg_stat_replication;
  - Check select * from pg_replication_slots;
  - Postgresql log files
Update rollback steps
Keep label changein-progress /label ~change::in-progress
Notify DBRE, DB Backend Engineers and SRE counterparts that the test is in progress

T plus 24 hours (2024-05-31 00:00 UTC) - Stop the multi-slot replication and gather data

Set label changein-progress /label ~change::in-progress

Run Ansible playbook to Stop the Logical Replication:

cd ~/src/db-migration/pg-upgrade-logical
ansible-playbook \
  -e "ansible_ssh_private_key_file=/home/dbupgrade/.ssh/id_dbupgrade" \
  -i inventory/gprd-main.yml \
  stop_upgrade_replication.yml -e "pg_publication_count=4" 2>&1 \
| ts | tee -a stop_replication_gprd_main_$(date +%Y%m%d).log

If it doesn't work use:

Drop Subscriptions in the TARGET V16

Query: select * from pg_stat_subscription;

For each subscription-slot do:

  ALTER SUBSCRIPTION logical_subscription_01 DISABLE;
  ALTER SUBSCRIPTION logical_subscription_01 SET (slot_name=NONE);
  DROP SUBSCRIPTION logical_subscription_01;

Drop Publication in the SOURCE V14
- Query: select * from pg_publication;
- Query: select slot_name, slot_type from pg_replication_slots where slot_type = 'logical';
- For each subscription-slot do:
```
  DROP PUBLICATION IF EXISTS logical_replication_01;
  SELECT pg_drop_replication_slot(slot_name) FROM pg_replication_slots WHERE slot_name = 'logical_replication_slot_01';
```

Check v14 and v16 Writer nodes
- Check gitlab-patronictl list
- Check select * from pg_stat_replication;
- Check select * from pg_replication_slots;
- Postgresql log files

Disable the DDL-related feature flags:

Disable feature flags by typing the following into #production:

/chatops run feature set disallow_database_ddl_feature_flags false

Inform the database team and Release managers that DDL feature flags is re-enabled in #g_delivery and #g_database

Hi @db-team, @release-managers @sre-oncall,

FYI, we just finished the test for database logical replication and we have re-enabled DDL feature flags in Production and any restrictions to database model changes were lifted.

Thanks for your patience!

On slack at #production /chatops run feature get disallow_database_ddl_feature_flags

Notify DBRE, DB Backend Engineers and SRE counterparts that the test is finished and that
Revert the MR change-lock!42 (merged) to remove the PCL from schedule
Create Revert MR of https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8509 to destroy the v16 cluster
Set label changescheduled /label ~change::scheduled

T plus 48 hours (2024-06-01 00:00 UTC) - Destroy `paroni-main-v16` single-node cluster

Set label changein-progress /label ~change::in-progress
Apply TF in the MR: TODO Destroy MR
Remove a silence at https://alerts.gitlab.net for node fqdn=~"patroni-main-v16.*" and env="gprd" duration of 5 days
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 minutes

Revert MR and apply TF to Destroy the patroni-main-v16 cluster: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8549
Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Metric: transactions_primary SLI Error Ratio
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: error ratio > 0.6% soft SLO for more than 10 minutes.
Metric: rails_primary_sql SLI Apdex
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: APDEX < 99.4% soft SLO for more than 10 minutes.
Metric: Patroni Service Error Ratio
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: error ratio > 0.6% soft SLO for more than 10 minutes.
Metric: patroni Service Apdex
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: APDEX < 99.4% soft SLO for more than 10 minutes.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Jun 03, 2024 by Rafael Henchen