Re-sync patroni-07

Production Change

Change Summary

patroni-07 needs to be re-synced with the primary because of this incident. As the necessary WAL fails already are missing on the primary, the fastest way is to sync from the GCS archive bucket using wal-e.

patroni-07 currently is out of rotation (noloadbalance, nofailover, and chef-client is disabled).

Provide a high-level summary of the change and its purpose.

Change Details

Services Impacted - ServicePatroni
Change Technician - @alejandro
Change Criticality - C1
Change Type - changeunscheduled
Change Reviewer - @msmiley
Due Date - 2020-08-25 21:00 UTC
Time tracking - 1h - 3h - depending on how many wal files need to be replayed
Downtime Component - no downtime expected

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1m

check for chef-client to be disabled (chef-client-is-enabled)
check patroni noloadbalance tags to be enabled in /var/opt/gitlab/patroni/patroni.yml

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60m

systemctl stop patroni
make sure postgres was stopped by patroni
- sudo -u gitlab-psql /usr/lib/postgresql/11/bin/pg_ctl -D /var/opt/gitlab/postgresql/data11/ status
- if not: sudo -u gitlab-psql /usr/lib/postgresql/11/bin/pg_ctl -D /var/opt/gitlab/postgresql/data11/ stop

add this line to recovery.conf:

restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch "-p 32" "%f" "%p"'

/usr/lib/postgresql/11/bin/pg_ctl -D /var/opt/gitlab/postgresql/data11/ -o "--config-file=/var/opt/gitlab/postgresql/postgresql.conf" start
wait for replication to catch up using wal-e - this might take a while. Check postgres.csv logs.
re-create the replication slot on the primary, to let it accumulate WAL files for patroni-07
- SELECT pg_create_physical_replication_slot('patroni_07_db_gprd_c_gitlab_production_internal');
- this slot shouldn't stay unused for too long, to not let the disk run full on the primary
/usr/lib/postgresql/11/bin/pg_ctl -D /var/opt/gitlab/postgresql/data11/ stop
systemctl start patroni
check if we are added back to the cluster
- gitlab-patronictl list
- wait a minute if patroni-07 isn't showing up immediately in the list
check for replication lag to be low and everything being ok in postgres.csv log and dashboards
remove noloadbalance, nofailover tags
systemctl reload patroni
check for connections to go up:
- for c in /usr/local/bin/pgb-console*; do $c -c 'SHOW CLIENTS;' | grep gitlabhq_production | grep -v gitlab-monitor; done | wc -l
chef-client-enable

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5m

check replication lag and patroni dashboard
remove silence for replication lag alert: https://alerts.gitlab.net/#/silences/bc7d7e91-8136-46ef-9e08-d573833ed700

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 1m

set noloadbalance and nofailover tags to bring node out of rotation, systemctl reload patroni
remove replication slot again, to prevent accumulating wal files on the primary
- on primary: select pg_drop_replication_slot('patroni_07_db_gprd_c_gitlab_production_internal');

Monitoring

Key metrics to observe

Metric: replication lag
- Location: Thanos
- What changes to this metric should prompt a rollback: the repelication lag fails to fall, but instead continues to rise.

Summary of infrastruture changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled).
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and resultes noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
There are currently no active incidents.

Edited Aug 26, 2020 by Henri Philipps