2020-03-30 Database failover and loss of sync to replicas
Summary
The DB master failed over from patroni-01 to patroni-11; the replicas (other than patroni-01) initially failed to connect (errors about the gitlab-replicator account credentials), then appeared to connect without any active changes on our part but reported timeline errors. Attempting manual pg_rewind failed with errors around the gitlab-replicator credentials. A temporary account was created, patroni reconfigured to use that for replication, and patroni restarted on the replicas (a few at a time). They rewound and caught up and returned to service.
Timeline
All times UTC.
2020-03-30
- 04:06 - Failover occurred
- approx 04:15 - various alerts.
- 04:37 - paged IMOC
- 04:39 - paged ongres
- 05:32 - patroni-09 reconfigured/restarted with new credentials
- 05:43 - patroni-09 caught up and serving read-only traffic. Other nodes started
- 05:53 - patroni-02 caught up
- 06:04 - chef-client stopped on all patroni nodes to prevent it overwriting patroni.yml with incorrect credentials
- 06:10 - patroni-08 caught up
- 06:27 - patroni-03, 04 and 05 caught up
- 06:29 - patroni-06 and 10 caught up
- 06:39 - patroni-07 caught up
- 06:42 - patroni-12 caught up
- 06:44 - chef prevented from any accidental runs on patroni nodes by renaming /etc/chef to /etc/chef.disabled.production.1865
~S1 ServicePostgres
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by Craig Miskell