Geo1 PostgreSQL master-master split-brain (#4533) · Issues · GitLab.org / GitLab

Geo1 PostgreSQL master-master split-brain

We discovered today that `postgres-01` and `postgres-02` both thought they were the masters for geo1. ``` root@postgres-01.db.geo1.gitlab.com:~# gitlab-ctl repmgr cluster show Role | Name | Upstream | Connection String ----------+--------------------------------|--------------------------------|-------------------------------------------------------------------------------------- * master | postgres-01.db.geo1.gitlab.com | | host=postgres-01.db.geo1.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr * master | postgres-02.db.geo1.gitlab.com | postgres-01.db.geo1.gitlab.com | host=postgres-02.db.geo1.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr ``` It appears to have happened after a restart of the postgres process on `postgres-01` at around 11:21 on 2017-01-08 ``` 2018-01-08_11:21:17.96638 postgres-01 postgresql: received TERM from runit, sending INT instead to force quit connections 2018-01-08_11:21:17.97098 postgres-01 postgresql: LOG: received fast shutdown request 2018-01-08_11:21:17.97108 postgres-01 postgresql: LOG: aborting any active transactions 2018-01-08_11:21:17.97115 postgres-01 postgresql: FATAL: terminating connection due to administrator command ``` From the `postgres-02` repmgr logs ``` 2018-01-08_11:21:19.49215 [2018-01-08 11:21:19] [ERROR] connection to database failed: FATAL: the database system is shutting down 2018-01-08_11:21:19.49225 FATAL: the database system is shutting down 2018-01-08_11:21:19.49232 2018-01-08_11:21:19.49466 [2018-01-08 11:21:19] [WARNING] connection to master has been lost, trying to recover... 60 seconds before failover decision 2018-01-08_11:21:29.49604 [2018-01-08 11:21:29] [WARNING] connection to master has been lost, trying to recover... 50 seconds before failover decision 2018-01-08_11:21:39.49740 [2018-01-08 11:21:39] [WARNING] connection to master has been lost, trying to recover... 40 seconds before failover decision 2018-01-08_11:21:49.49875 [2018-01-08 11:21:49] [WARNING] connection to master has been lost, trying to recover... 30 seconds before failover decision 2018-01-08_11:21:59.49973 [2018-01-08 11:21:59] [WARNING] connection to master has been lost, trying to recover... 20 seconds before failover decision 2018-01-08_11:22:09.50079 [2018-01-08 11:22:09] [WARNING] connection to master has been lost, trying to recover... 10 seconds before failover decision 2018-01-08_11:22:19.50221 [2018-01-08 11:22:19] [ERROR] unable to reconnect to master (timeout 60 seconds)... 2018-01-08_11:22:24.56389 [2018-01-08 11:22:24] [NOTICE] this node is the best candidate to be the new master, promoting... ``` I'm not entirely sure what caused the restart of `postgres-01`'s postgres process, it may have been us. We need to investigate why an automatic failover failed, as this is very similar to what happened in production in gitlab-com/infrastructure#3512. We also may consider turning off automatic failover. In the meantime, I told `postgres-02` to follow `postgres-01` again and things are back to the way they should be. cc/ @jarv

issue