Skip to content

[Production] Switchover to new patroni leader and refresh old one

Production Change

Change Summary

After #4721 (closed) we'd have refreshed all our patroni cluster nodes but the leader. In this CR we'll perform a switchover to a new leader in order to refresh the old one

Change Details

  1. Services Impacted - ServicePatroni
  2. Change Technician - @ahmadsherif
  3. Change Criticality - C1
  4. Change Type - changeunscheduled
  5. Change Reviewer - @alejandro
  6. Due Date - 2021-07-10 9:00 UTC
  7. Time tracking - ~1 hour
  8. Downtime Component - up to 5 mins

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • Disenable automatic database reindexing via Slack chatops: /chatops run feature set database_reindexing false
  • Set label changein-progress on this issue
  • Verify that patroni-v12-02 is still the current leader. If not, adjust the procedure bellow accordingly

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • Create an alert silence with the following matcher(s):
    • fqdn = patroni-v12-(02|05)-db-gprd.c.gitlab-production.internal (check Regex)
  • Create an alert silence with the following matcher(s):
    • alertname = PostgreSQL_UnusedReplicationSlot
    • slot_name = patroni_v12_02_db_gprd_c_gitlab_production_internal
  • Switchover to a new primary:
    • ssh patroni-v12-02-db-gprd.c.gitlab-production.internal
    • sudo gitlab-patronictl switchover --master patroni-v12-02-db-gprd.c.gitlab-production.internal --candidate patroni-v12-05-db-gprd.c.gitlab-production.internal
  • Make sure a new leader has been elected and all replicas are following it:
    • sudo gitlab-patronictl list | grep Leader | grep running
  • Take the now-replica out of Rails load-balancing:
    • a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -enable -service=db-replica$i -reason="CR #4781"; done
  • Wait until all clients have been disconnected from the replica:
    • while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;'; done | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
    • Wait until the output is zero
  • Disable Chef
    • sudo chef-client-disable "CR #4781"
  • Make sure chef-client is not running. If not, wait until it finishes.
  • Remove pgbouncer Consul services:
    • sudo rm /etc/consul/conf.d/db-replica*
  • Reload Consul:
    • sudo systemctl reload consul
  • Make sure there are no clients connected to the replica:
    • while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;'; done | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
  • In gitlab-com-infrastructure:
    • cd environments/gprd
    • tf taint "module.patroni-v12.google_compute_instance.instance_with_attached_disk[1]" && tf taint "module.patroni-v12.google_compute_disk.data_disk[1]" && tf taint "module.patroni-v12.google_compute_disk.log_disk[1]"
    • tf apply -target=module.patroni-v12
  • If td-agent is refusing to start, run: sudo /opt/td-agent/bin/gem uninstall google-protobuf -v 3.17.3
  • Take the replica out of Rails load-balancing:
    • a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -enable -service=db-replica$i -reason="CR #4781"; done
    • We want to control when to add the replica to Rails load-balancing; otherwise it would be added once it processed enough WAL segments from GCS.
  • Start Patroni sudo systemctl start patroni
  • Wait until the replica has caught up to the primary
    • sudo gitlab-patronictl list | grep $(hostname -I) | grep running
  • Wait until the replication lag between the replica and the primary is diminished
    • while true; do sudo gitlab-patronictl list | grep $(hostname -I) | cut -d'|' -f 7; sleep 180; done
  • Add the replica to Rails
    • a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -disable -service=db-replica$i; done
  • Expire the two alerts created at the beginning

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • Run while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;'; done | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done, the number should be increasing gradually
  • Re-enable automatic database reindexing via Slack chatops: /chatops run feature set database_reindexing true

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Scenario 1: Before running tf apply:

  • Add the replica to Rails load-balancing:
    • a=("" "-1" "-2"); for i in "${a[@]}"; do consul maint -disable -service=db-replica$i; done
  • Do the verification step(s) above

Scenario 2: After running tf apply:

No viable rollback steps, we have to go forward with the change.

Monitoring

Key metrics to observe

  • Metric: Metric Name
    • Location: Dashboard URL
    • What changes to this metric should prompt a rollback: Describe Changes

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited by Ahmad Sherif