Geo1 PostgreSQL master-master split-brain

We discovered today that postgres-01 and postgres-02 both thought they were the masters for geo1.

root@postgres-01.db.geo1.gitlab.com:~# gitlab-ctl repmgr cluster show
Role      | Name                           | Upstream                       | Connection String
----------+--------------------------------|--------------------------------|--------------------------------------------------------------------------------------
* master  | postgres-01.db.geo1.gitlab.com |                                | host=postgres-01.db.geo1.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
* master  | postgres-02.db.geo1.gitlab.com | postgres-01.db.geo1.gitlab.com | host=postgres-02.db.geo1.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr

It appears to have happened after a restart of the postgres process on postgres-01 at around 11:21 on 2017-01-08

2018-01-08_11:21:17.96638 postgres-01 postgresql: received TERM from runit, sending INT instead to force quit connections
2018-01-08_11:21:17.97098 postgres-01 postgresql: LOG:  received fast shutdown request
2018-01-08_11:21:17.97108 postgres-01 postgresql: LOG:  aborting any active transactions
2018-01-08_11:21:17.97115 postgres-01 postgresql: FATAL:  terminating connection due to administrator command

From the postgres-02 repmgr logs

2018-01-08_11:21:19.49215 [2018-01-08 11:21:19] [ERROR] connection to database failed: FATAL:  the database system is shutting down
2018-01-08_11:21:19.49225 FATAL:  the database system is shutting down
2018-01-08_11:21:19.49232
2018-01-08_11:21:19.49466 [2018-01-08 11:21:19] [WARNING] connection to master has been lost, trying to recover... 60 seconds before failover decision
2018-01-08_11:21:29.49604 [2018-01-08 11:21:29] [WARNING] connection to master has been lost, trying to recover... 50 seconds before failover decision
2018-01-08_11:21:39.49740 [2018-01-08 11:21:39] [WARNING] connection to master has been lost, trying to recover... 40 seconds before failover decision
2018-01-08_11:21:49.49875 [2018-01-08 11:21:49] [WARNING] connection to master has been lost, trying to recover... 30 seconds before failover decision
2018-01-08_11:21:59.49973 [2018-01-08 11:21:59] [WARNING] connection to master has been lost, trying to recover... 20 seconds before failover decision
2018-01-08_11:22:09.50079 [2018-01-08 11:22:09] [WARNING] connection to master has been lost, trying to recover... 10 seconds before failover decision
2018-01-08_11:22:19.50221 [2018-01-08 11:22:19] [ERROR] unable to reconnect to master (timeout 60 seconds)...
2018-01-08_11:22:24.56389 [2018-01-08 11:22:24] [NOTICE] this node is the best candidate to be the new master, promoting...

I'm not entirely sure what caused the restart of postgres-01's postgres process, it may have been us. We need to investigate why an automatic failover failed, as this is very similar to what happened in production in #3512 (moved). We also may consider turning off automatic failover.

In the meantime, I told postgres-02 to follow postgres-01 again and things are back to the way they should be.

cc/ @jarv

Edited by Alex Hanselka