Patroni replica restart and Primary switchover
C2
Production Change - Criticality 2Change Objective | Patroni replica restart and primary switchover |
---|---|
Change Type | Operation |
Services Impacted | ServicePatroni |
Change Team Members | @hphilipps, @Finotto, ongres for monitoring |
Change Criticality | C2 |
Change Reviewer | @Finotto, @NikolayS, ongres, datastores team |
Tested in staging | Tested in staging |
Dry-run output | Dry run by using ansible-playbook --check
|
Due Date | 2020-06-13 22:00 UTC |
Time tracking | less than 1h (depending on how long we wait for connections to drop etc. The switchover itself will take less than a minute.) |
Summary
To be able to decommission the nodes patroni-09..12 we want to switch over the primary DB from patroni-11 to patroni-01.
As this will necessarily impose a short interruption of DB connections, we want to use this opportunity to change 3 postgres settings (max_connections: 300 -> 500
, autovacuum_max_workers: 6 -> 10
, max_wal_size: 5gb -> 8gb
) that also require a postgres restart (and have been applied in staging already), to get both done within a single short downtime.
We will do a rolling restart of all replicas with the new settings first, taking them out of rotation one-by-one to not interrupt client connections. This will take around 20m and cause no downtime.
Then we will execute a primary switchover using patroni. This will cause a short interruption of db connections (probably below 10s). We will execute the steps during low-traffic times and most of it automated through ansible to keep the downtime and risk as low as possible.
The procedure has been tested in staging multiple times.
Detailed steps for the change
Pre-Conditions
-
patroni-11 is primary and patroni-01..08 are up and in sync -
Chef MR with config changes is approved -
ansible environment prepared on console node
Execution
-
Inform EOC -
Set silence for patroni nodes: gprd and ops -
Merge MR -
execute chef-client on patroni nodes: knife ssh 'role:gprd-base-db-patroni' sudo chef-client
-
open tmux session in ansible workdir on console node (setup instructions) -
check inventory file -
execute restart-replicas playbook: ansible-playbook -i inventory/gprd.yml -e "run_edit_config=true" playbooks/restart_postgres.yml
- this will execute the following steps for each replica, one-by-one:
- apply central
max_connections=500
setting via DCS - Stop chef-client service
- Take node out of loadbalancing and failover
- Wait for connections to drain (timeout 4m, allowed to fail)
- Terminate still existing connections
- Execute Checkpoint
- Restart Postgres
- Wait for postgres to become ready (timeout 5m)
- Check for replication lag to be below 2s
- Enable loadbalancing and failover for this node again
- Check for traffic ramping up
- Start chef-client service
- apply central
-
check status and replication lag for all replicas: gitlab-patronictl list
-
check for latencies and error rates in the Platform Triage Dashboard -
check if settings have been applied gitlab-psql -c 'SELECT name,setting FROM pg_settings;'
-
execute switchover playbook: ansible-playbook -i inventory/gprd.yml -e "candidate=patroni-01-db-gprd.c.gitlab-production.internal" playbooks/switchover_primary.yml
- this will execute the following:
- Execute Checkpoint on primary and replicas
- Wait for replication lag on switchover candidate to be below 2s
- Set very low checkpoint_completion_target on candidate (to prevent replicas having to wait for a spread checkpoint after switchover)
- Switchover primary
- Wait for new primary to become ready
- Check if old primary is ready as replica
- reset checkpoint_completion_target on candidate to old value
- Wait for replication lag on old primary to be below 2s
-
check status and replication lag for all replicas: gitlab-patronictl list
-
check for latencies and error rates in the Platform Triage Dashboard -
check replicas in consul dig @localhost -p 8600 +short db-replica.service.consul. SRV | awk '{ print $4 }'
-
take patroni-11 out of rotation for decommissioning later -
systemctl stop chef-client
-
Take node out of loadbalancing and failover, reload patroni -
wait for all connections to drain from patroni-11 (can take a few minutes) -
for c in /usr/local/bin/pgb-console*; do $c -c 'SHOW CLIENTS;' | grep gitlabhq_production | grep -v gitlab-monitor; done | wc -l
-
-
Check cluster status -
systemctl stop patroni
-
check if patroni-11 is gone from the cluster: gitlab-patronictl list
-
Post Execution
-
check replication lag of delayed and archive replica -
make sure vacuum analyze was executed on the new primary (automatically done on roll change by patroni) -
make sure postgres-checkup project is including the new primary: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10343 -
remove silences but leave a silence for patroni-09..12 -
decommission patroni-09..12
Rollback steps
- for rolling back the config changes:
- revert the MR
- run chef-client
- restart postgres on the primary (it needs to have max_connections lowered before the replicas)
gitlab-patronictl restart --force -r master pg11-ha-cluster
- adjust the ansible inventory to the current cluster topology (if we switched the primary)
- execute the restart postgres playbook again like above
- for rolling back to the previous primary:
- adjust Ansible inventory to the current cluster topology
ansible-playbook -i inventory/gprd.yml -e "candidate=patroni-11-db-gprd.c.gitlab-production.internal" playbooks/switchover_primary.yml
- Or in emergency without Ansible:
gitlab-patronictl switchover --master patroni-01-db-gprd.c.gitlab-production.internal --candidate patroni-11-db-gprd.c.gitlab-production.internal
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
SRE on-call has been informed prior to change being rolled out -
There are currently no open issues labeled as ServiceMonitoring with severities of ~S1 or ~S2