Geo1 PostgreSQL master-master split-brain
We discovered today that postgres-01 and postgres-02 both thought they were the masters for geo1.
root@postgres-01.db.geo1.gitlab.com:~# gitlab-ctl repmgr cluster show
Role | Name | Upstream | Connection String
----------+--------------------------------|--------------------------------|--------------------------------------------------------------------------------------
* master | postgres-01.db.geo1.gitlab.com | | host=postgres-01.db.geo1.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
* master | postgres-02.db.geo1.gitlab.com | postgres-01.db.geo1.gitlab.com | host=postgres-02.db.geo1.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr
It appears to have happened after a restart of the postgres process on postgres-01 at around 11:21 on 2017-01-08
2018-01-08_11:21:17.96638 postgres-01 postgresql: received TERM from runit, sending INT instead to force quit connections
2018-01-08_11:21:17.97098 postgres-01 postgresql: LOG: received fast shutdown request
2018-01-08_11:21:17.97108 postgres-01 postgresql: LOG: aborting any active transactions
2018-01-08_11:21:17.97115 postgres-01 postgresql: FATAL: terminating connection due to administrator command
From the postgres-02 repmgr logs
2018-01-08_11:21:19.49215 [2018-01-08 11:21:19] [ERROR] connection to database failed: FATAL: the database system is shutting down
2018-01-08_11:21:19.49225 FATAL: the database system is shutting down
2018-01-08_11:21:19.49232
2018-01-08_11:21:19.49466 [2018-01-08 11:21:19] [WARNING] connection to master has been lost, trying to recover... 60 seconds before failover decision
2018-01-08_11:21:29.49604 [2018-01-08 11:21:29] [WARNING] connection to master has been lost, trying to recover... 50 seconds before failover decision
2018-01-08_11:21:39.49740 [2018-01-08 11:21:39] [WARNING] connection to master has been lost, trying to recover... 40 seconds before failover decision
2018-01-08_11:21:49.49875 [2018-01-08 11:21:49] [WARNING] connection to master has been lost, trying to recover... 30 seconds before failover decision
2018-01-08_11:21:59.49973 [2018-01-08 11:21:59] [WARNING] connection to master has been lost, trying to recover... 20 seconds before failover decision
2018-01-08_11:22:09.50079 [2018-01-08 11:22:09] [WARNING] connection to master has been lost, trying to recover... 10 seconds before failover decision
2018-01-08_11:22:19.50221 [2018-01-08 11:22:19] [ERROR] unable to reconnect to master (timeout 60 seconds)...
2018-01-08_11:22:24.56389 [2018-01-08 11:22:24] [NOTICE] this node is the best candidate to be the new master, promoting...
I'm not entirely sure what caused the restart of postgres-01's postgres process, it may have been us. We need to investigate why an automatic failover failed, as this is very similar to what happened in production in #3512 (moved). We also may consider turning off automatic failover.
In the meantime, I told postgres-02 to follow postgres-01 again and things are back to the way they should be.
cc/ @jarv
Edited by Alex Hanselka