Skip to content

Re-sync patroni-07

Production Change

Change Summary

patroni-07 needs to be re-synced with the primary because of this incident. As the necessary WAL fails already are missing on the primary, the fastest way is to sync from the GCS archive bucket using wal-e.

patroni-07 currently is out of rotation (noloadbalance, nofailover, and chef-client is disabled).

Provide a high-level summary of the change and its purpose.

Change Details

  1. Services Impacted - ServicePatroni
  2. Change Technician - @alejandro
  3. Change Criticality - C1
  4. Change Type - changeunscheduled
  5. Change Reviewer - @msmiley
  6. Due Date - 2020-08-25 21:00 UTC
  7. Time tracking - 1h - 3h - depending on how many wal files need to be replayed
  8. Downtime Component - no downtime expected

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1m

  • check for chef-client to be disabled (chef-client-is-enabled)
  • check patroni noloadbalance tags to be enabled in /var/opt/gitlab/patroni/patroni.yml

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60m

  • systemctl stop patroni
  • make sure postgres was stopped by patroni
    • sudo -u gitlab-psql /usr/lib/postgresql/11/bin/pg_ctl -D /var/opt/gitlab/postgresql/data11/ status
    • if not: sudo -u gitlab-psql /usr/lib/postgresql/11/bin/pg_ctl -D /var/opt/gitlab/postgresql/data11/ stop
  • add this line to recovery.conf:
    restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch "-p 32" "%f" "%p"'
  • /usr/lib/postgresql/11/bin/pg_ctl -D /var/opt/gitlab/postgresql/data11/ -o "--config-file=/var/opt/gitlab/postgresql/postgresql.conf" start
  • wait for replication to catch up using wal-e - this might take a while. Check postgres.csv logs.
  • re-create the replication slot on the primary, to let it accumulate WAL files for patroni-07
    • SELECT pg_create_physical_replication_slot('patroni_07_db_gprd_c_gitlab_production_internal');
    • this slot shouldn't stay unused for too long, to not let the disk run full on the primary
  • /usr/lib/postgresql/11/bin/pg_ctl -D /var/opt/gitlab/postgresql/data11/ stop
  • systemctl start patroni
  • check if we are added back to the cluster
    • gitlab-patronictl list
    • wait a minute if patroni-07 isn't showing up immediately in the list
  • check for replication lag to be low and everything being ok in postgres.csv log and dashboards
  • remove noloadbalance, nofailover tags
  • systemctl reload patroni
  • check for connections to go up:
    • for c in /usr/local/bin/pgb-console*; do $c -c 'SHOW CLIENTS;' | grep gitlabhq_production | grep -v gitlab-monitor; done | wc -l
  • chef-client-enable

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5m

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 1m

  • set noloadbalance and nofailover tags to bring node out of rotation, systemctl reload patroni
  • remove replication slot again, to prevent accumulating wal files on the primary
    • on primary: select pg_drop_replication_slot('patroni_07_db_gprd_c_gitlab_production_internal');

Monitoring

Key metrics to observe

  • Metric: replication lag
    • Location: Thanos
    • What changes to this metric should prompt a rollback: the repelication lag fails to fall, but instead continues to rise.

Summary of infrastruture changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled).
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and resultes noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
  • There are currently no active incidents.
Edited by Henri Philipps