[Production] Refresh Postgres cluster

Production Change

Change Summary

Post-upgrade, we found that we were unable to provision working Postgres 12 replicas. That issue has been addressed, with the compromise of temporarily injecting a manual step post-provisioning.

The configuration deltas have been reduced to those that are expected: scope, name, connect_address (2), and recovery_conf (Patroni does the right thing on Postgres 12 with this section). During the process of resolving the Chef issues, we discovered some inconsistencies in the production configuration (for instance, pg_repack missing from the recipe, and the incorrect DNS configuration for Consul, which used the Ubuntu-18 setup instead of the Ubuntu-16 setup).

We are therefore ready to roll these changes out to staging and production.

In general, we expect we can converge the configurations without prompting Patroni to reload them, which would in term potentially restart Postgres. Thus, we will refresh the entire cluster from scratch, which would guarantee we are running through the exact same procedure that was tested in the benchmarking environment. This would entail draining a replica (say, 01) and rebuilding it; then perform a switch over to the new primary, and repeat the provisioning for each remaining replica.

For each replica:

drain replica: runbook
Re-provision replica: see #4580 (closed) for how we did the cascade source replica
enable traffic to replica runbook

Change Details

Services Impacted - ServicePatroni
Change Technician - @ahmadsherif
Change Criticality - C1
Change Type - changescheduled
Change Reviewer - @alejandro
Due Date - 2021-06-04 14:00 UTC
Time tracking - TBD
Downtime Component - N/A

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Set label changein-progress on this issue
Merge and apply https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/59
In your local shell run: export counter=09

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

This list is executed per host, to see the current status please check the latest comments below

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Run while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;'; done | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done, the number should be increasing gradually
Remove alert silences with the following matche(s):
- fqdn = patroni-v12-$counter-db-gprd.c.gitlab-production.internal (Replace $counter with actual zero-padded value)

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Secnario 1: Before running `tf apply`:

Add the replica to Rails load-balancing:
- a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -disable -service=db-replica$i; done
Do the verification step(s) above

Secnario 2: After running `tf apply`:

No viable rollback steps, we have to go forward with the change.

Monitoring

Key metrics to observe

Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Sep 21, 2022 by Ahmad Sherif

[Production] Refresh Postgres cluster

Production Change

Change Summary

Change Details

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Change Steps - steps to take to execute the change

Post-Change Steps - steps to take to verify the change

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Secnario 1: Before running tf apply:

Secnario 2: After running tf apply:

Monitoring

Key metrics to observe

Summary of infrastructure changes

Changes checklist

Secnario 1: Before running `tf apply`:

Secnario 2: After running `tf apply`: