[Production] Switchover to new patroni leader and refresh old one
Production Change
Change Summary
After #4721 (closed) we'd have refreshed all our patroni cluster nodes but the leader. In this CR we'll perform a switchover to a new leader in order to refresh the old one
Change Details
- Services Impacted - ServicePatroni
- Change Technician - @ahmadsherif
- Change Criticality - C1
- Change Type - changeunscheduled
- Change Reviewer - @alejandro
- Due Date - 2021-07-10 9:00 UTC
- Time tracking - ~1 hour
- Downtime Component - up to 5 mins
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Disenable automatic database reindexing via Slack chatops: /chatops run feature set database_reindexing false
-
Set label changein-progress on this issue -
Verify that patroni-v12-02
is still the current leader. If not, adjust the procedure bellow accordingly
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Create an alert silence with the following matcher(s): -
fqdn
=patroni-v12-(02|05)-db-gprd.c.gitlab-production.internal
(check Regex)
-
-
Create an alert silence with the following matcher(s): -
alertname
=PostgreSQL_UnusedReplicationSlot
-
slot_name
=patroni_v12_02_db_gprd_c_gitlab_production_internal
-
-
Switchover to a new primary: -
ssh patroni-v12-02-db-gprd.c.gitlab-production.internal
-
sudo gitlab-patronictl switchover --master patroni-v12-02-db-gprd.c.gitlab-production.internal --candidate patroni-v12-05-db-gprd.c.gitlab-production.internal
-
-
Make sure a new leader has been elected and all replicas are following it: sudo gitlab-patronictl list | grep Leader | grep running
-
Take the now-replica out of Rails load-balancing: a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -enable -service=db-replica$i -reason="CR #4781"; done
-
Wait until all clients have been disconnected from the replica: while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;'; done | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
- Wait until the output is zero
-
Disable Chef sudo chef-client-disable "CR #4781"
-
Make sure chef-client
is not running. If not, wait until it finishes. -
Remove pgbouncer Consul services: sudo rm /etc/consul/conf.d/db-replica*
-
Reload Consul: sudo systemctl reload consul
-
Make sure there are no clients connected to the replica: while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;'; done | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
-
In gitlab-com-infrastructure
:-
cd environments/gprd
-
tf taint "module.patroni-v12.google_compute_instance.instance_with_attached_disk[1]" && tf taint "module.patroni-v12.google_compute_disk.data_disk[1]" && tf taint "module.patroni-v12.google_compute_disk.log_disk[1]"
-
tf apply -target=module.patroni-v12
-
-
If td-agent
is refusing to start, run:sudo /opt/td-agent/bin/gem uninstall google-protobuf -v 3.17.3
-
Take the replica out of Rails load-balancing: a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -enable -service=db-replica$i -reason="CR #4781"; done
- We want to control when to add the replica to Rails load-balancing; otherwise it would be added once it processed enough WAL segments from GCS.
-
Start Patroni sudo systemctl start patroni
-
Wait until the replica has caught up to the primary sudo gitlab-patronictl list | grep $(hostname -I) | grep running
-
Wait until the replication lag between the replica and the primary is diminished while true; do sudo gitlab-patronictl list | grep $(hostname -I) | cut -d'|' -f 7; sleep 180; done
-
Add the replica to Rails a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -disable -service=db-replica$i; done
-
Expire the two alerts created at the beginning
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Run while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;'; done | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
, the number should be increasing gradually -
Re-enable automatic database reindexing via Slack chatops: /chatops run feature set database_reindexing true
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
tf apply
:
Scenario 1: Before running -
Add the replica to Rails load-balancing: a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -disable -service=db-replica$i; done
-
Do the verification step(s) above
tf apply
:
Scenario 2: After running No viable rollback steps, we have to go forward with the change.
Monitoring
Key metrics to observe
- Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Ahmad Sherif