2020-03-30 Database failover and loss of sync to replicas

Summary

The DB master failed over from patroni-01 to patroni-11; the replicas (other than patroni-01) initially failed to connect (errors about the gitlab-replicator account credentials), then appeared to connect without any active changes on our part but reported timeline errors. Attempting manual pg_rewind failed with errors around the gitlab-replicator credentials. A temporary account was created, patroni reconfigured to use that for replication, and patroni restarted on the replicas (a few at a time). They rewound and caught up and returned to service.

Timeline

All times UTC.

2020-03-30

04:06 - Failover occurred
approx 04:15 - various alerts.
04:37 - paged IMOC
04:39 - paged ongres
05:32 - patroni-09 reconfigured/restarted with new credentials
05:43 - patroni-09 caught up and serving read-only traffic. Other nodes started
05:53 - patroni-02 caught up
06:04 - chef-client stopped on all patroni nodes to prevent it overwriting patroni.yml with incorrect credentials
06:10 - patroni-08 caught up
06:27 - patroni-03, 04 and 05 caught up
06:29 - patroni-06 and 10 caught up
06:39 - patroni-07 caught up
06:42 - patroni-12 caught up
06:44 - chef prevented from any accidental runs on patroni nodes by renaming /etc/chef to /etc/chef.disabled.production.1865

~S1 ServicePostgres

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Mar 30, 2020 by Craig Miskell