Geo1 PostgreSQL master-master split-brain
We discovered today that `postgres-01` and `postgres-02` both thought they were the masters for geo1.
```
root@postgres-01.db.geo1.gitlab.com:~# gitlab-ctl repmgr cluster show
Role | Name | Upstream | Connection String
----------+--------------------------------|--------------------------------|--------------------------------------------------------------------------------------
* master | postgres-01.db.geo1.gitlab.com | | host=postgres-01.db.geo1.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
* master | postgres-02.db.geo1.gitlab.com | postgres-01.db.geo1.gitlab.com | host=postgres-02.db.geo1.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
```
It appears to have happened after a restart of the postgres process on `postgres-01` at around 11:21 on 2017-01-08
```
2018-01-08_11:21:17.96638 postgres-01 postgresql: received TERM from runit, sending INT instead to force quit connections
2018-01-08_11:21:17.97098 postgres-01 postgresql: LOG: received fast shutdown request
2018-01-08_11:21:17.97108 postgres-01 postgresql: LOG: aborting any active transactions
2018-01-08_11:21:17.97115 postgres-01 postgresql: FATAL: terminating connection due to administrator command
```
From the `postgres-02` repmgr logs
```
2018-01-08_11:21:19.49215 [2018-01-08 11:21:19] [ERROR] connection to database failed: FATAL: the database system is shutting down
2018-01-08_11:21:19.49225 FATAL: the database system is shutting down
2018-01-08_11:21:19.49232
2018-01-08_11:21:19.49466 [2018-01-08 11:21:19] [WARNING] connection to master has been lost, trying to recover... 60 seconds before failover decision
2018-01-08_11:21:29.49604 [2018-01-08 11:21:29] [WARNING] connection to master has been lost, trying to recover... 50 seconds before failover decision
2018-01-08_11:21:39.49740 [2018-01-08 11:21:39] [WARNING] connection to master has been lost, trying to recover... 40 seconds before failover decision
2018-01-08_11:21:49.49875 [2018-01-08 11:21:49] [WARNING] connection to master has been lost, trying to recover... 30 seconds before failover decision
2018-01-08_11:21:59.49973 [2018-01-08 11:21:59] [WARNING] connection to master has been lost, trying to recover... 20 seconds before failover decision
2018-01-08_11:22:09.50079 [2018-01-08 11:22:09] [WARNING] connection to master has been lost, trying to recover... 10 seconds before failover decision
2018-01-08_11:22:19.50221 [2018-01-08 11:22:19] [ERROR] unable to reconnect to master (timeout 60 seconds)...
2018-01-08_11:22:24.56389 [2018-01-08 11:22:24] [NOTICE] this node is the best candidate to be the new master, promoting...
```
I'm not entirely sure what caused the restart of `postgres-01`'s postgres process, it may have been us. We need to investigate why an automatic failover failed, as this is very similar to what happened in production in gitlab-com/infrastructure#3512. We also may consider turning off automatic failover.
In the meantime, I told `postgres-02` to follow `postgres-01` again and things are back to the way they should be.
cc/ @jarv
issue