Re-sync patroni-07
Production Change
Change Summary
patroni-07 needs to be re-synced with the primary because of this incident. As the necessary WAL fails already are missing on the primary, the fastest way is to sync from the GCS archive bucket using wal-e.
patroni-07 currently is out of rotation (noloadbalance
, nofailover
, and chef-client is disabled).
Provide a high-level summary of the change and its purpose.
Change Details
- Services Impacted - ServicePatroni
- Change Technician - @alejandro
- Change Criticality - C1
- Change Type - changeunscheduled
- Change Reviewer - @msmiley
- Due Date - 2020-08-25 21:00 UTC
- Time tracking - 1h - 3h - depending on how many wal files need to be replayed
- Downtime Component - no downtime expected
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1m
-
check for chef-client to be disabled ( chef-client-is-enabled
) -
check patroni noloadbalance
tags to be enabled in/var/opt/gitlab/patroni/patroni.yml
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 60m
-
systemctl stop patroni
-
make sure postgres was stopped by patroni sudo -u gitlab-psql /usr/lib/postgresql/11/bin/pg_ctl -D /var/opt/gitlab/postgresql/data11/ status
- if not:
sudo -u gitlab-psql /usr/lib/postgresql/11/bin/pg_ctl -D /var/opt/gitlab/postgresql/data11/ stop
-
add this line to recovery.conf: restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch "-p 32" "%f" "%p"'
-
/usr/lib/postgresql/11/bin/pg_ctl -D /var/opt/gitlab/postgresql/data11/ -o "--config-file=/var/opt/gitlab/postgresql/postgresql.conf" start
-
wait for replication to catch up using wal-e - this might take a while. Check postgres.csv logs. -
re-create the replication slot on the primary, to let it accumulate WAL files for patroni-07 SELECT pg_create_physical_replication_slot('patroni_07_db_gprd_c_gitlab_production_internal');
- this slot shouldn't stay unused for too long, to not let the disk run full on the primary
-
/usr/lib/postgresql/11/bin/pg_ctl -D /var/opt/gitlab/postgresql/data11/ stop
-
systemctl start patroni
-
check if we are added back to the cluster gitlab-patronictl list
- wait a minute if patroni-07 isn't showing up immediately in the list
-
check for replication lag to be low and everything being ok in postgres.csv log and dashboards -
remove noloadbalance
,nofailover
tags -
systemctl reload patroni
-
check for connections to go up: for c in /usr/local/bin/pgb-console*; do $c -c 'SHOW CLIENTS;' | grep gitlabhq_production | grep -v gitlab-monitor; done | wc -l
-
chef-client-enable
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5m
-
check replication lag and patroni dashboard -
remove silence for replication lag alert: https://alerts.gitlab.net/#/silences/bc7d7e91-8136-46ef-9e08-d573833ed700
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 1m
-
set noloadbalance
andnofailover
tags to bring node out of rotation,systemctl reload patroni
-
remove replication slot again, to prevent accumulating wal files on the primary - on primary:
select pg_drop_replication_slot('patroni_07_db_gprd_c_gitlab_production_internal');
- on primary:
Monitoring
Key metrics to observe
- Metric: replication lag
- Location: Thanos
- What changes to this metric should prompt a rollback: the repelication lag fails to fall, but instead continues to rise.
Summary of infrastruture changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue.) -
There are currently no active incidents.
Edited by Henri Philipps