Provide mechanism to recover previously failed database master
Omnibus HA currently has nice commands to register PG nodes and bring up an HA database cluster. However, it doesn't provide good ways to recover a failed node and bring it back into the PG/Repmgr cluster.
In my own testing, I could successfully reregister a failed node with the following steps:
gitlab-ctl repmgr standby setup db1.example.com -w
su - gitlab-psql
/opt/gitlab/embedded/bin/repmgr -f /var/opt/gitlab/postgresql/repmgr.conf standby register --force
exit
But it seems like this doesn't work in all cases, either. I was just on a call with a customer and we tried the same steps but kept getting met with "node XXXXX is already registered". We ended up having to delete the node from the repmgr database directly: delete from repmgr_gitlab_cluster.repl_nodes where name = 'foo';
. Then we could use the above standby register command.
There may be 2 parts to this:
- Better
gitlab-ctl
support for re-registering a failed node - Documentation on how to recover a failed node.