PostgreSQL Split Brain

Context

At 01:00 UTC, the PostgreSQL process on postgres-02 and postgres-03 restarted. postgres-02 restarted as a master and postgres-03 restarted to start following postgres-02 as the new master. This occurred due to failure in DNS resolution for postgres-01.db.prd.gitlab.com. From 01:00 until 01:41, we may have been serving stale data.

Timeline

On date: 2017-01-08

01:00 UTC - The postgres process on postgres-02 restarted when it failed to resolve postgres-01.db.prd.gitlab.com and restarted as master.
01:00 UTC - The postgres process on postgres-03 restarted to begin following postgres-01 as master.
01:08 UTC - The alert for replication lag fires
01:15 UTC - The oncall engineer discovered that the postgres-03 process restarted
01:30 UTC - It is discovered that repmgr has gotten out of sync and we had a split-brain issue
01:30 UTC - The oncall engineer called a team member to help out
01:41 UTC - The deploy page is put up to avoid writes to the database
01:42 UTC - postgres-02 and postgres-03 are removed from the DB load balancing
- This was done by removing their IPs from db_load_balancing in the gitlab-base and canary-base roles.
01:43 UTC - The removal is deployed to the fleet
- This was deployed by running chef-client on all the fe and be servers.
01:48 UTC - We begin to rebuild replication on postgres-02
- gitlab-ctl repmgr standby setup postgres-01.db.prd.gitlab.com
02:06 UTC - The removal finished deploying
02:06 UTC - The deploy page is taken down
02:44 UTC - postgres-02 finished repairing backup and was added back to repmgr
- gitlab-ctl repmgr standby register
02:54 UTC - Begin repair of postgres-03 replication
- gitlab-ctl repmgr standby setup postgres-01.db.prd.gitlab.com
03:50 UTC - postgres-03 finished repairing and was added back to repmgr and db_load_balancing
- gitlab-ctl repmgr standby register

Incident Analysis

PagerDuty alert Postgres Replication lag is over 2 minutes alerted us to the issue.
Auto failover was enabled earlier this week, which is what led to this split brain situation.
This was unrelated to the db failover this morning (#3496 (closed)).

PostgreSQL Split Brain

Context

Timeline

Incident Analysis

Root Cause Analysis

What went well

What can be improved

Corrective actions

Guidelines