[Production] Refresh Postgres cluster
Production Change
Change Summary
Post-upgrade, we found that we were unable to provision working Postgres 12 replicas. That issue has been addressed, with the compromise of temporarily injecting a manual step post-provisioning.
The configuration deltas have been reduced to those that are expected: scope
, name
, connect_address
(2), and recovery_conf
(Patroni does the right thing on Postgres 12 with this section). During the process of resolving the Chef issues, we discovered some inconsistencies in the production configuration (for instance, pg_repack
missing from the recipe, and the incorrect DNS configuration for Consul, which used the Ubuntu-18 setup instead of the Ubuntu-16 setup).
We are therefore ready to roll these changes out to staging and production.
In general, we expect we can converge the configurations without prompting Patroni to reload them, which would in term potentially restart Postgres. Thus, we will refresh the entire cluster from scratch, which would guarantee we are running through the exact same procedure that was tested in the benchmarking environment. This would entail draining a replica (say, 01
) and rebuilding it; then perform a switch over to the new primary, and repeat the provisioning for each remaining replica.
For each replica:
- drain replica: runbook
- Re-provision replica: see #4580 (closed) for how we did the cascade source replica
- enable traffic to replica runbook
Change Details
- Services Impacted - ServicePatroni
- Change Technician - @ahmadsherif
- Change Criticality - C1
- Change Type - changescheduled
- Change Reviewer - @alejandro
- Due Date - 2021-06-04 14:00 UTC
- Time tracking - TBD
- Downtime Component - N/A
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Set label changein-progress on this issue -
Merge and apply https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/59 -
In your local shell run: export counter=09
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
This list is executed per host, to see the current status please check the latest comments below
-
Create an alert silence with the following matche(s): -
fqdn
=patroni-v12-$counter-db-gprd.c.gitlab-production.internal
(Replace$counter
with actual zero-padded value)
-
-
Create an alert silence with the following matcher(s): -
alertname
=PostgreSQL_UnusedReplicationSlot
-
slot_name
=patroni_v12_$counter_db_gprd_c_gitlab_production_internal
(Replace$counter
with actual zero-padded value)
-
-
verify that patroni-v12-02
is still the current leader. If not adapt the below steps to account for the change in leader. -
ssh patroni-v12-$counter-db-gprd.c.gitlab-production.internal
-
Take the replica out of Rails load-balancing: a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -enable -service=db-replica$i -reason="CR #4721"; done
-
Wait until all clients have been disconnected from the replica: while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;'; done | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
- Wait until the output is zero
-
Disable Chef sudo chef-client-disable "CR #4721"
-
Make sure chef-client
is not running. If not, wait until it finishes. -
Remove pgbouncer Consul services: sudo rm /etc/consul/conf.d/db-replica*
-
Reload Consul: sudo systemctl reload consul
-
Make sure there are no clients connected to the replica: while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;'; done | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
-
In gitlab-com-infrastructure
:-
cd environments/gprd
-
index=$((counter-1))
-
tf taint "module.patroni-v12.google_compute_instance.instance_with_attached_disk[$index]" && tf taint "module.patroni-v12.google_compute_disk.data_disk[$index]" && tf taint "module.patroni-v12.google_compute_disk.log_disk[$index]"
-
tf apply -target=module.patroni-v12
-
-
If td-agent
is refusing to start, run:sudo /opt/td-agent/bin/gem uninstall google-protobuf -v 3.17.2
-
Take the replica out of Rails load-balancing: a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -enable -service=db-replica$i -reason="CR #4721"; done
- We want to control when to add the replica to Rails load-balancing; otherwise it would be added once it processed enough WAL segments from GCS.
-
Start Patroni sudo systemctl start patroni
-
Wait until the replica has caught up to the primary sudo gitlab-patronictl list | grep $(hostname -I) | grep running
-
Wait until the replication lag between the replica and the primary is diminished while true; do sudo gitlab-patronictl list | grep $(hostname -I) | cut -d'|' -f 7; sleep 180; done
-
Add the replica to Rails a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -disable -service=db-replica$i; done
-
counter=0$((counter-1)); test "${counter}" = "02" && counter=0$((counter-1))
-
patroni-v12-02
is the current leader, we want to skip it topatroni-v12-01
-
-
Expire the two alerts created at the beginning
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Run while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;'; done | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
, the number should be increasing gradually -
Remove alert silences with the following matche(s): -
fqdn
=patroni-v12-$counter-db-gprd.c.gitlab-production.internal
(Replace$counter
with actual zero-padded value)
-
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
tf apply
:
Secnario 1: Before running -
Add the replica to Rails load-balancing: a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -disable -service=db-replica$i; done
-
Do the verification step(s) above
tf apply
:
Secnario 2: After running No viable rollback steps, we have to go forward with the change.
Monitoring
Key metrics to observe
- Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.